Re: String character range

From: Doug Tody <dtody-at-nrao.edu>
Date: Fri, 1 Aug 2008 14:53:06 -0600 (MDT)


Hey Mark -

I agree with your sentiment that string data which we want to manipulate in any language or environment should be simple; if necessary a separate datatype could be declared for representing e.g. general Unicode encoded text.

What about UTF-8 though? This is backwards compatible with ASCII but allows any Unicode character to be represented using multi-byte sequences - if there are no funny characters it is the same as ASCII. This is much like your escape sequence proposal, but is a widely used standard. XML has mandatory support for UTF-8 (almost any XML document one sees is UTF-8 encoded) so there should be no problems there.

I suspect that if some old ASCII-oriented code got a UTF-8 encoded string containing multi-byte Unicode characters it would print these oddly, however it would probably still work (things like the null test for end of string etc. still work normally for UTF-8). There would be no problem for the usual case of simple ASCII text.

On Fri, 1 Aug 2008, Mark Taylor wrote:

> On Fri, 1 Aug 2008, Carlos Rodrigo Blanco wrote:
>
> > Hi
> >
> > I'm sorry that I don't know much about unicode encoding and I feel quite
> > ashamed of showing this ignorance, but I wonder what happens with latin
> > characters and so.
> >
> > If I have to write, for instance, some author name in a xml document that
> > includes some latin character (like ñ), is that allowed?
>
> Writing it in an XML document - no problem. XML, and Unicode on which
> it is based, is very capable at representing almost any character
> from almost any language you can think of (and a lot more).
>
> As far as SAMP goes: that character looks to me like code point 0xf1, from the
> Latin-1 Supplement code block. So you could not send it using either the
> existing definition for a SAMP string or the proposal (4) that I am
> suggesting. If we used a variant of my suggestion (3):
>
> 3. Define some escaping convention for un-XML characters, e.g. \u001f
> for character 31.
>
> with the intention that this escaping mechanism could be used for
> any 8-bit character it would be possible to transmit this kind of non-7-bit
> Latin character. However, characters with the 8th bit set might cause
> problems for certain other transports and language environments. I must admit
> apart from RFC-822 mail-type contexts I can't think of what these might be,
> but I'd be inclined to steer clear of non-7-bit characters just in case.
> However, if others (e.g. with less Anglo-Saxon prejudices) think that it's an
> important requirement to permit transmission of characters like this within
> SAMP we could take that on board. We could even in principle say that this
> escaping mechanism could be used to specify any Unicode character - but I
> think that would definitely be a bad idea as it would effectively restrict use
> of the protocol to languages with Unicode support, which excludes quite a lot.
>
> Mark
>
> --
> Mark Taylor Astronomical Programmer Physics, Bristol University, UK
> m.b.taylor@bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/
Received on 2008-08-01Z22:54:56