Re: Spectrum data model

From: Doug Tody <dtody-at-nrao.edu>
Date: Wed, 13 Sep 2006 23:41:27 -0600


Hi All -

After reading Anita's careful review of Spectrum (thanks Anita!) and Jonathan's thoughtful replies I think the issues below are the most important, so some further elaboration follows.

Required/optional vs must/should/may

    The advantage of must/should/may is that it allows us to differentiate     between "minimal compliance" (all the "must"s) and "full compliance"
(all the "should"s). This is useful as we want minimal compliance
    to be as low a bar as is reasonable, but we would really prefer that     most services implement at least the "should"s. To reward service     implementors for doing more we would do something like flag fully     compliant services in the registry. Hence I tend to agree that it is     useful to make the must/should/may distinction.

    In general what is required or optional depends upon how a general     data model is used - it might be different in different circumstances.     For Spectrum the priorities are probably pretty clear, but for     something more general like Char it will really depend upon the     application (hence it is not clear how much this should be specified     at the level of the Char spec).

Coordinate systems other than just ra/dec

    For the 2nd generation DAL interfaces it is probably too restrictive     to limit ourselves to only ICRS/J2000, as for SIA. For example, we     already have folks trying to use DAL for solar data. A reasonable     compromise is to default coordinates to ICRS as in SIA, but provide a     means to optionally specify a different coordinate system; whether or     not other coordinate systems are supported would be a service-specific     capability.

    The above refers mainly to the query interface and standard     parameters. To describe the actual data we probably want to     permit the native coordinate systems of the data to be used.     This is already done in SIA 1.0, where the WCS information allows     the coordinate system to be specified rather than requiring that a     new WCS be computed to publish the data.

Should Coverage.Location (or whatever) be a MUST

    I agree with Jonathan that fundamental metadata such as this is a     "must". Anita is correct that it may not be appropriate for all     data, e.g., theory data, but we should at least require it where it     is appropriate for the data. Rather than define what "appropriate"     means it might be better to define values such as "not applicable"     or "undefined", and still require such a value to be specified even     for data where the value is not applicable. This would allow more     rigorous queries to be performed. The problem is, this may not be     possible for numeric values other than in a text-based serialization.
(I saw something like this elsewhere recently, possibly in VOEvent).

Mediation to a standard data model vs pass-through of native data

    This is an essential feature of SSA. There is no standard     astronomical format for spectra, and at the scale of the VO, where     a client application may access spectra from dozens of archives,     it becomes impractical for each client application to know how to     deal with spectral data from dozens of different projects (sure,     a few applications do this now for a few archives, but that is not     good enough, and such a scheme will break whenever anything changes).

    What we want to make possible is for each SSA service to return data     conforming to the SSA data models (Spectrum in this case), so that     the mediation occurs once in the service rather than hundreds of     times in remote applications. A pass-through for "native" format     data is also important, in part for on-the-cheap services that can't     perform the data conversion, or more importantly, to obtain direct     access to the native data so that clients with intimate knowledge     of a specific data collection can take advantage of project-specific     features of the data. Both approaches are important.

Target.Name vs dataset IDs, collection, etc.

    Target.Name is just the name of the observed object (if any), such     as one might pass to a name resolver. (Title is the more important     version of this since it always applies and is broader).

    Collection is the data collection (ShortName) e.g., "SDSS-DR4"     or whatever. DataID.CreatorDID is the dataset ID (URI) assigned     to the dataset (spectrum) by its creator, e.g., the survey project     or observatory which created the data collection. The CreatorDID     does not change if the data is replicated. Curation.PublisherDID is     the dataset ID assigned by the publisher, and will be different for     each publisher.

    It is possible that the published dataset returned by the service may     differ significantly from the "parent" (Creator's) dataset, e.g., in     the case of virtual or derived data. This can be indicated with the     CreationType attribute. For example, if we extract a spectrum from a     data cube, CreatorID identifies the cube, PublisherID the extracted     spectrum, and CreationType is something like "extracted spectrum".     This is a primitive form of provenance model. If a completely new     collection is formed by analysis then a new Creator resource is     required to describe it. Received on 2006-09-14Z07:42:02