Hi folks,
Some comments on these two drafts: one general comment and a number of more specific ones. Apologies if I've forgotten earlier discussions that concluded with agreement on the things with which I disagree below.
My general comment is just a slight unease that both of these documents are so reminiscent of the detailed design choices behind the SkyQuery.Net prototype. Now, I'm a great fan of SkyQuery.Net, and nobody has come up with anything better, to my knowledge, but I'm worried that in wishing to be pragmatic and get something working, we're enshrining in our standards implementation details from one prototype system without proper consideration of alternatives or of generalities.
ADQL v1.01 ---------- 1. XMATCH. I'm worried about giving one particular cross-matchingalgorithm such a prominence in the query language. I've bored many of you with this at length before, but the truth is that the general catalogue cross-matching problem is a tricky one, with different algorithms being appropriate in different situations. The XMATCH algorithm developed as part of SkyQuery.Net is a clever solution given certain simplifying assumptions, and it has a number of nice features, but it is only one possible cross-matching algorithm.
Now, ADQL v1.01 is an advance on some earlier versions in that its XMATCH function explicitly allows for more than one mode (via the fourth parameter), but it leaves the third parameter being something that may not be well defined in modes other than the SkyQuery.Net XMATCH algorithm. Would it not be better to have the mode given by the third parameter and to allow for further parameters which would have meaning only for specific modes?
To give you an example of what I have in mind, consider the following. Consider two databases A and B, which contain data with very different angular resolutions - maybe A is an optical sky survey, and B comes from a single-dish radio survey - such that within the error ellipse of each source in B may lie a number of objects from A. In that case, cross-matching by spatial proximity alone is inadequate, and the simple application of the SkyQuery.Net XMATCH algorithm will yield some bad associations. To get good associations in this situation you need something more - maybe prior astrophysical knowledge of the properties of the populations of objects found in A and B, or the application of some machine learning technique which will figure out useful correlations between the attributes describing the sources in the two catalogues which can be used in making associations, or a cached list of associations between them which were made earlier and can be re-used now. These might each be different modes of XMATCH, but each will require a different parameter or set of parameters for the SkyNodes to run them.
Of course, most of these cross-matching algorithms one can think of will have a spatial component to them, even if they don't work purely by spatial proximity, like the SkyQuery.Net algorithm. So, maybe the correct basic operation is finding all neighbours in one database of each entry in another, given some maximum matching radius. Most other crossmatching algorithms can start with such a restricted list of plausible candidate matches, but this NEIGHBOURS operation lacks some of the nice features of the SkyQuery.Net XMATCH, such as being symmetric in the first two arguments.
Anyway, maybe the pragmatic decision at this point is to accept the SkyQuery.Net XMATCH as the default cross-matching algorithm, as it will work in many situations, but it is not the solution to the general problem, and it worries me that it is being implicitly enshrined as such in ADQL, with the result that unwary users could easily make bad matches. At the very least, I think the order of the third and subsequent parameters should be rearranged.
2. TOP. I don't think that the current explanation of TOP - that it returns "the first N records satisfying the criteria" - is very meaningful. Surely its inclusion in a query ensures only that no more than N records are returned, and, since there is no ordering of the records in the full result set assumed, the notion of "first N" has no meaning.
3. Units. It's not clear to me how the inclusion of units in a query will work, and whether they should be there at all. The "Columns" interface defined in the SkyNode Interface spec will return the units and even Basic SkyNodes are supposed to implement that, so why do we also have units in the query?
Also, how would the units part of the query be used? If I have a database which stores fluxes in mJy and I get a query with a constraint expressed in terms of fluxes in Jy, am I expected to convert the numerical value or do I throw an exception? If the former, what list of possible units am I expected to understand?
4. ADQL Grammar. This is really just a question, but is the current Appendix - "the ANTLR grammar used to produce the [ADQL] parser in C#" - really the best way to express the grammar of ADQL?
SkyNode Interface v1.01 ----------------------- 1. Rank. Section 4.2.2 discusses the necessity for an orderingmechanism, which would indicate to the user which are the most important columns, make it possible (e.g.) to get RA and Dec to appear near enough other in a column list and identify which columns are of interest to "radio astronomers, cosmologists, optical astronomers", etc. I don't see that any of these are pressing needs.
2. QI-12. I think the QueryCost() interface should return the number of rows satisfying a given set of criteria and not the surface density of such objects. Consider the case of database C which is a shallow, all-sky survey and database D which is a catalogue generated from a single, deep HST image. D may have a much higher surface density of objects with, say, g-r > 1, but C is likely to have a far larger number of rows satisfying that criterion, and it's clearly the number of rows which is what is needed for the query planning activity envisaged here.
3. QI-15. This is another place where I'm worried that we may be defining a standard basic on the detailed design choices of the SkyQuery.Net prototype. In SkyQuery.Net the query is passed down the chain of databases and the developing result set is passed up it, and that is what is envisaged for the ExecutePlan() method here - albeit with the added refinement that SkyNodes only receive the relevant portion of the plan, not the full plan. But, it may be that it's preferable to have the partial result sets joined in some compute node, rather than in one of the SkyNodes, or it may be preferable to deposit partial result sets in VOSpace and pass around handles to files there, rather than passing around the full datasets. I don't know what will be best, but I'm worried that we're defining language constructs on the basis of one prototype implementation.
cheers
Bob Received on 2005-07-12Z20:08:07