On Fri, 1 Aug 2008, Doug Tody wrote:
> Hey Mark -
>
> I agree with your sentiment that string data which we want to
> manipulate in any language or environment should be simple; if
> necessary a separate datatype could be declared for representing
> e.g. general Unicode encoded text.
>
> What about UTF-8 though? This is backwards compatible with ASCII
> but allows any Unicode character to be represented using multi-byte
> sequences - if there are no funny characters it is the same as ASCII.
> This is much like your escape sequence proposal, but is a widely used
> standard. XML has mandatory support for UTF-8 (almost any XML document
> one sees is UTF-8 encoded) so there should be no problems there.
Hi Doug,
you're right, UTF-8 does look like a better solution than the \uxxxx escaping mechanism (borrowed from Java) that I suggested as far as transmitting things like accented letters and characters from non-Latin alphabets. However, it doesn't solve the problem which started this thread off, since you still won't be able to include characters in the ranges excluded by the XML Char definition; those are simply not permitted in an XML document, regardless of encoding (and in any case the UTF-8 encoding of 0x1f is the single byte 0x1f).
Mark
-- Mark Taylor Astronomical Programmer Physics, Bristol University, UK m.b.taylor@bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/Received on 2008-08-04Z09:59:36