Hallo all.
while writing the hub tests, I have come across a problem with the definition of the SAMP string data type. Section 3.3 of the SAMP doc defines a string as:
"a scalar value consisting of a sequence of characters;
each character may be in the range 0x01-0x7f"
Section 2.2 of the XML specification meanwhile
(http://www.w3.org/TR/2006/REC-xml-20060816/#charsets) has the following
BNF production for characters allowed in an XML document:
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF]
| [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the
surrogate blocks, FFFE, and FFFF. */
(I do not understand the comment here - as far as I can see Unicode
does include the other control characters in the range #x0-#x1f.
Oh well).
What this means is that there are legal SAMP strings (ones containing any character in the ranges 0x01-0x08, 0x0B, 0x0C, 0x0E-0x1F) which cannot be transmitted as an XML-RPC <string> element. This means that either the definition of a SAMP string, or the prescription for transmitting SAMP strings in XML-RPC messages in the Standard Profile, must be modified to avoid inconsistency.
I think the possibilities are as follows:
Both (1) and (2) would entail significant extra complication
(base64 decoding required) for Standard Profile clients, and (2) would
additionally make debugging harder (it's nice that you can see what's
in a SAMP/XML-RPC message just by looking). (3) would make life a bit
more complicated than now for clients, but not that much. The existing
legal range 0x01-0x7f for SAMP string characters was in any case just
intended to be a range of characters which would be sufficient for
'normal' strings, while excluding non-printable ones (i.e. ones which
would likely cause problems for some transport types), and it looks
like I decided on a range that was too wide for that purpose.
So I suggest that we do (4). I think we do need at least one line-break character, though the need for both 0xA and 0x0D may be moot, as is the need for 0x09 (tab). So I suggest that we change the definition of a SAMP string in sec 3.3 to one of:
4a. "a scalar value consisting of a sequence of characters;
each character may be in the range 0x20-0x7f or one of
the special characters 0x09 (tab), 0x0A (line feed) or
0x0d (carriage return)"
or
4b. "a scalar value consisting of a sequence of characters;
each character may be in the range 0x20-0x7f or the
line break character 0x0a"
(4b) might be more rigorous since it obviates the possibility of
confusion when transforming between OSs (Windows and *nix), but
since SAMP usage will probably mostly be intra-OS this might cause
more trouble than it's worth - also, I bet that Windows-based
implementations would routinely violate this in any case
(see Goldfarb's First Law of Text Processing) so probably 4a is
better.
Comments/agreements/disagreements?
Mark
-- Mark Taylor Astronomical Programmer Physics, Bristol University, UK m.b.taylor@bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/Received on 2008-08-01Z12:02:40