Doug Tody wrote:
> I suggest reviewing the protocol document and commenting on what is
> required for a minimally or fully compliant service for the specific
> use-case of data discovery and selection.
Pavlos Protopapas wrote:
> Now about the issue of ID that I raised.
> A simple scenario. Lets say somehow
> I get two spectra. I do not know how and why. May be
> my program generates them therefore SSA was not involved
> in this. Now I want to make sure that I do not have duplicates.
> I do need an ID, don't I ?
A separate ID field is certainly helpful and perhaps even "required" (for instance, to guarantee uniqueness of sample selection for some statistical study), but it may not be strictly necessary. Combining information from a small selection of other metadata may provide as unique an identifier as an archive-or-service-supplied identifier. For raw data, telescope+timestamp is often sufficient. NOAO has been adding such an OBSID keyword to our headers for many years. (Nobody will dispute the value of having the disambiguation string precomputed.) For telescopes in which multiple instruments may be used in a single observing session, we add an instrument ID to the mix. For instruments that take multiple exposures or that can take rapid sequential exposures, we add an instrument supplied running number.
VO in general and spectra in particular are typically not focused on raw data, of course, and multiple data products can result from a single raw input. In that case one might consider disambiguating by adding a processing code. Then you run into the versioning problem - perhaps a pipeline was run twice with different calibrations. So add a versioning code. There is always some way to disambiguate.
The point I'm trying to reach is that an ID is no guarantee of uniqueness unless the entire chain of data handling and processing is always controlled - and in that case other metadata may serve equally well.
The only true ID is supplied by each dataset itself, for instance as a checksum, hash function, message digest or digital signature of the pixels (however represented for a spectrum). I've often used IRAF imstat to report skew and kurtosis as well as the more typical low order statistical moments when I need true confirmation that an image I'm handling in one context is the same as another image presented to me in a different context.
One could imagine protecting the metadata using similar techniques, for instance, by "blinking" one FITS header against another (overlay two xterm windows and toggle each in turn). But unlike the data values themselves, metadata may not preserve ordering, header keywords may be rearranged, etc. Semantics implies keyword selection, but then you are just back to the original discussion above.
But of course the NOAO Science Archive, and the "Save the bits" data- store before it, adds a serial number to each ingested data product. In some real sense, however, each file's MD5 or each HDU's FITS checksum is the only real identifier once a dataset escapes into the wild. An archive's (or VO service's) internal identifiers are only rigorously reliable for data kept close to home. Data security and data identification are two aspects of the same issue.
Rob Seaman Received on 2006-10-28Z15:22:18