Re: String character range

From: Doug Tody <dtody-at-nrao.edu>
Date: Wed, 20 Aug 2008 14:28:09 -0600 (MDT)


Hi Again -

Well this is amusing. In the process of testing some SSA code I generated a VOTable which could not be read in either Topcat or in a Microsoft XML viewer I have on Windows, although it was processed fine in some other programs.

The offending text turned out to be the following:

    <PARAM ID="ContactName" datatype="char"     name="ContactName" ucd="meta.bib.author;meta.curation"     utype="spec:Spectrum.Curation.Contact.Name" value="László Dobos"     arraysize="*">

The problem of course is that UTF-8 extensions were used in "László". So, this obviously can cause problems with existing software; in this case though it was legal XML.

On Tue, 19 Aug 2008, Doug Tody wrote:

> Hi Mark -
>
> I also don't think this is a very important issue, but if others do,
> this would be a reasonable way to provide it without compromising
> legacy code which does not support UTF-8. I would note though that
> most text formats and text-oriented software I have seen in recent
> years specify UTF-8 rather than ASCII, so clearly it is a widely used
> standard. In actual implementations which use standard libraries it
> is probably going to be supported anyway, so while UTF-8 might not
> be required it may be wise to permit it as a feature.
>
> - Doug
>
>
> On Tue, 19 Aug 2008, Mark Taylor wrote:
>
> > Doug,
> >
> > On Mon, 4 Aug 2008, Doug Tody wrote:
> >
> > > Sure, I agree that the range of allowable chars should be restricted
> > > as you suggest. My suggestion is to specify UTF-8, restricted as
> > > has been discussed for 7-bit chars, but allowing UTF-8 encoded chars
> > > to pass through. That would seem to do it and we still have simple
> > > ASCII virtually all of the time so I don't think this will break
> > > legacy code. If at some point full up unicode is needed (eg 16 bit
> > > chars), that should be a different data type.
> >
> > I am slightly against this, since it reduces the simplicity of what's
> > going on. In practice, as you say, I think the amount of problematic
> > behaviour that defining SAMP string content as UTF-8 would cause would be very
> > small. But I've had to go to the Unicode web site and read the UTF-8 FAQs to
> > convince myself that this is the case. Sloppy programmers who don't carefully
> > read the spec and treat the byte stream as if it's ASCII will be fine >99% of
> > the time. But some burden will be imposed on careful programmers who want to
> > make sure that the UTF-8 is treated properly, especially if they are working
> > on platforms which are not Unicode-aware. If non-Latin character transmission
> > is in the category "essential" or even "nice to have" I'd say this is a price
> > worth paying. If it's just "because we can" I'd say it's not. Responses so
> > far to my question:
> >
> > On Mon, 4 Aug 2008, Mark Taylor wrote:
> >
> > > Which of these is best depends on how important the requirement to be
> > > able to send Unicode and control characters is. My vote is not very.
> > > Can we have a show of hands?
> >
> > suggest to me that this is in the "because we can" category. But if people
> > believe that non-Latin character transmission is something
> > that we really ought to have in SAMP strings, then I'd go along with
> > this suggestion.
> >
> > Mark
> >
> > --
> > Mark Taylor Astronomical Programmer Physics, Bristol University, UK
> > m.b.taylor@bris.ac.uk +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/
> >
>
Received on 2008-08-20Z22:29:18