Re: String character range

From: Mark Taylor <m.b.taylor-at-bristol.ac.uk>
Date: Mon, 4 Aug 2008 11:09:16 +0100 (BST)


On Fri, 1 Aug 2008, Luigi Paioro wrote:

> Hi.
>
> I find that your suggestion below is a good compromise. I would split it in
> two points:
>
> 1. At SAMP protocol definition level we might define that "string" can accept
> any sequence of 0X01-0x7f characters adding the escape convention for any
> printable Unicode char out of the specified range (so it is general).
>
> 2. At Standard Profile level I would put more constraints, limiting the
> charset to the XML range and introducing the escape convention for the other
> unsupported chars.
>
> Is it reasonable?

Luigi,

that is a reasonable way to go for permitting transmission of Unicode characters. However, any kind of escaping does introduce a fair amount of fiddly complication to handle all cases, both in the standard and at the client end.

In the standard we have to say exactly what counts as a unicode escape, which characters it is permitted/required for, and make sure that there is some mechanism for escaping the escape (so for instance if you want to send a string that looks like the ASCII "\u001f" rather than the Unicode character at code point 0x1f, there has to be a way of doing that which will not be misunderstood).

At the client end, for reading strings at least, implementors will have to make sure that they take account of all of these things in order to decode a string acquired from the SAMP transport (XML-RPC in the case of the Standard Profile). Not hard in Unicode-aware languages which use the same escaping mechanism as SAMP does for Unicode characters (Java, Python); not too hard in languages designed for text manipulation (Perl); probably quite a drag in certain other languages which do not fall into these categories (C, FORTRAN, IDL) - I'd guess at least 10-20 lines of code just for string decoding (though in many cases quite likely client implementations would treat it as normal ASCII and work 99% of the time, behaving incorrectly in mostly-not-very-catastrophic ways 1%). Of course the best that languages with no Unicode support can do in any case if they encounter non-ASCII Unicode characters is probably to replace them with a "?" or something.

If we reckon that transmission of
(a) control characters (everything between 0x01 and 0x1f) and (b) non-7-bit-ASCII characters (Unicode beyond 0x7f) is a requirement for what we're doing here, OK, let's draft a revised definition of the SAMP string data type which is capable of doing all this and clients will have to do the extra work if they want to behave correctly.

My feeling is it would be better to restrict what can be sent in a SAMP string to something that is going to be easy to implement in all sensible languages/transports (probably 0x09, 0x0a, 0x0d, 0x20-0x7f), so that both the standard, and the requirements on clients, stay as simple as possible. If specific requirements for sending full Unicode strings arise, we could mark these on a per-MType basis and come up with a convention along the lines of the SAMP int and SAMP float already defined in Section 3.4.

Which of these is best depends on how important the requirement to be able to send Unicode and control characters is. My vote is not very. Can we have a show of hands?

-- 
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor@bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/
Received on 2008-08-04Z12:09:19