Re: Architecture of IVOA version 0.4

From: Clive Page <cgp-at-star.le.ac.uk>
Date: Wed, 19 May 2004 12:36:26 +0100 (BST)


Roy

Sorry for a belated response, but hope comments on your V0.4 document are still being accepted.

I think the VO projects need a document explaining our plans in terms that the astronomer-in-the-street can understand. The majority of astronomers that I meet (other than those directly involved in the VO) seem to think we are simply wasting public money. I'm not sure whether this document is capable of fulfilling that need, but parts of it already come close. To do this it will need footnotes (or appendices) explaining the jargon and TLAs, such as XML.

We also need a top-down architectural description, and a reasoned explanation of why each component is needed; this will help fulfil Andy's wish for an analysis which will identify any gaps in our architecture.

I think the current draft is lacking in section 2, when it baldly states that there are three broad classes of service: registry services, data services, and compute services. The latter two will be familiar to astronomers, but the registry services will not, and need justification and explanation. Without such background, section 6 will be more or less incomprehensible to most astronomers. A fuller justification may uncover something of a lack of concensus on the centrality of the registry among VO projects, but that's no bad thing: if some VO projects are going to have a more detailed registry than others, we need to be sure that they will interwork properly.

The basic idea of a registry is easy to explain: you need at the very least a list of top-level URLs of astronomy data centres and the services they support, otherwise searches will be no better than you can do using Google. The GLU system of CDS is such a basic registry, with the addition of details of the interfaces of each service, so that e.g. cone-searches can be distributed to a number of data centres using a single set of parameters.

The addition of more metadata to the registry can be justified on the grounds of making searches more directed: say you want to find all observations of position (ra,dec) in waveband X then having at least crude sky coverage and waveband details attached to each registry entry will make such queries a lot faster, and avoid queries to inappropriate sites.

Further justification, I think, comes from the recognition that many data services are fronted by DBMS, but queries to them will be almost impossible to formulate without knowing the name, data types, units, UCDs and maybe other details of each column in each table. Current relational DBMS have no provision for such metadata beyond the bare column name. To make the problems concrete, consider an ADQL/S query like:

 SELECT * FROM table WHERE REGION('CIRCLE ICRS 123.4, 45.6, 2')    AND properMotionRA > 100;

The REGION bit is feasible because we have *defined* that its arguments are always in degrees, but the proper-motion bit will work only if the user knows the units or we require them to be the same for every data service in the VO. I think the latter is impractical (Vizier catalogues alone use 19 different units for proper motion, none of them the SI unit for angular velocity). We could require the use of "standard" units in every query with local translations, but I suspect that getting agreement on a set of standard units will be very hard.

Alternatively we could require that every table in a VO-compliant DBMS supports a metadata query (such as: give me the units and UCDs of all your columns). If I understand it right, most people seem to think that such queries are best performed on the Registry. Data centre managers who will have to populate these metadata databases will want this under their control, which is likely to lead to a registry associated with each data centre. The local registry will have full details of local datasets, but having full details of all VO datasets in the world might make them too large. The solution seems to be a distributed information system, much like the DNS, with regular harvesting. This structure can perhaps be explained to astronomers in these terms.

Section 3: web and grid services. It would be sensible to explain that HTTP get/post services can only be queried using specially tailored form interfaces (or where something like GLU is available to transform query parameters to the form they need). I think that WSDL can be used to describe such interfaces - does this solve an the problem of making current HTTP/CGI based services available as VO resources? If so it would be nice to include a short explanation of this.

I agree that the main advantage of SOAP is that WSDL can be used to describe them; does WSDL do everything we want, or is more needed? An outline of UDDI would be worthwhile, explaining why the VO world has rejected it.

VOTable format: it would be good to include a description of why the VO needs a new data format. Billions of FITS files exist, so astronomers will want to know why we are inventing a new format which is generally ten times as verbose and which hardly any current software will handle. I think we ought not to ignore the current debates over the structure of XML files holding tabular data: VOTable is something of a compromise over what is convenient for astronomical use, and what is easily parsed by existing XML tools.

OpenSkyQuery and ADQL: It concerns me a bit that we are developing or at least considering a number of separate query lanagues: ADQL for tabular data, with another for image/spectral data, and yet another for registry queries. It would be nice for the architecture document to pull these together, and identify the somewhat conflicting requirements. Then we might be able to work out whether it is feasible to have just one single VO query language, or not.

Regards

-- 
Clive Page
Dept of Physics & Astronomy,
University of Leicester,
Leicester, LE1 7RH,  U.K.
Received on 2004-05-19Z11:36:46