String character range

From: Mark Taylor <m.b.taylor-at-bristol.ac.uk>
Date: Fri, 1 Aug 2008 11:02:32 +0100 (BST)


Hallo all.

while writing the hub tests, I have come across a problem with the definition of the SAMP string data type. Section 3.3 of the SAMP doc defines a string as:

     "a scalar value consisting of a sequence of characters;
      each character may be in the range 0x01-0x7f"

Section 2.2 of the XML specification meanwhile
(http://www.w3.org/TR/2006/REC-xml-20060816/#charsets) has the following
BNF production for characters allowed in an XML document:

    [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF]

                 | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

                            /* any Unicode character, excluding the
                               surrogate blocks, FFFE, and FFFF. */

(I do not understand the comment here - as far as I can see Unicode
does include the other control characters in the range #x0-#x1f. Oh well).

What this means is that there are legal SAMP strings (ones containing any character in the ranges 0x01-0x08, 0x0B, 0x0C, 0x0E-0x1F) which cannot be transmitted as an XML-RPC <string> element. This means that either the definition of a SAMP string, or the prescription for transmitting SAMP strings in XML-RPC messages in the Standard Profile, must be modified to avoid inconsistency.

I think the possibilities are as follows:

  1. Encode all SAMP strings as <base64> elements when transmitting over XML-RPC.
  2. Allow SAMP strings to be transmitted as either <string> or <base64> elements when transmitting over XML-RPC (the latter case being required only if the string contains un-XML characters).
  3. Define some escaping convention for un-XML characters, e.g. \u001f for character 31.
  4. Change the SAMP string definition so that only XML-friendly characters are allowed.

Both (1) and (2) would entail significant extra complication
(base64 decoding required) for Standard Profile clients, and (2) would
additionally make debugging harder (it's nice that you can see what's in a SAMP/XML-RPC message just by looking). (3) would make life a bit more complicated than now for clients, but not that much. The existing legal range 0x01-0x7f for SAMP string characters was in any case just intended to be a range of characters which would be sufficient for 'normal' strings, while excluding non-printable ones (i.e. ones which would likely cause problems for some transport types), and it looks like I decided on a range that was too wide for that purpose.

So I suggest that we do (4). I think we do need at least one line-break character, though the need for both 0xA and 0x0D may be moot, as is the need for 0x09 (tab). So I suggest that we change the definition of a SAMP string in sec 3.3 to one of:

   4a. "a scalar value consisting of a sequence of characters;

        each character may be in the range 0x20-0x7f or one of
        the special characters 0x09 (tab), 0x0A (line feed) or
        0x0d (carriage return)"

or

   4b. "a scalar value consisting of a sequence of characters;

        each character may be in the range 0x20-0x7f or the
        line break character 0x0a"

(4b) might be more rigorous since it obviates the possibility of
confusion when transforming between OSs (Windows and *nix), but since SAMP usage will probably mostly be intra-OS this might cause more trouble than it's worth - also, I bet that Windows-based implementations would routinely violate this in any case
(see Goldfarb's First Law of Text Processing) so probably 4a is
better.

Comments/agreements/disagreements?

Mark

-- 
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor@bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/
Received on 2008-08-01Z12:02:40