Data set metadata schemas

From: Anita Richards <amsr-at-jb.man.ac.uk>
Date: Tue, 17 Jun 2003 20:05:43 +0100 (BST)

I have learnt from the debate on Registry Schema but I can't add much except to say that I think the differences will only be resolved in use. We should start applying the schema we have got to real data sets and science queries (even if these have to be carefully selected at first) - and that seems to mean starting from http://www.ivoa.net/internal/IVOA/IvoaResReg/ResourceServiceMetadataV7.pdf (RSMV7). So to that extent I agree with Bob, and maybe that is not controversial as I think I am only talking about what Tony says it is OK for, i.e. "Sky coverage services" if that is taken to include spatial spectral and temporal coverage and other metadata tied to data sets. That is, what information does the Registry need to hold to select potentially useful data sets and use their metadata to select the appropriate services to then access the data for processing (but I am not commenting on the parts of the registry which describe appropriate services).

I would like to understand better how feasible it is to link sections, for example an entry in COMMUNITY may be the same as one for Contributor in CURATION.

We also need to think a bit more about how to aquire the dataset metadata. At least at first, we want to make sure this is done in an examplary fashion because we will be judged by the results, so it is no good using difficulty in getting information as an excuse. In my experience with the 4 data sets so far, the relevant information is not held in one place, it requires human searching of web-sites and human discrimination, for example to decide what is the region of regard for a catalogue - PSF (but what about systematic errors)? Pixel size (but this is arbitrary in radio images)? Largest error given (may be spurious/huge)? Eventually we will have algorithms to help decide but these will be evolved through experience, not trying to imagine all possible circumstances. For data sets which are actively curated, we can ask someone to fill in a questionaire, however, again, we will only discover what is open to misinterpretation after a few rounds with archivists. More seriously, we do not yet have the kudos to get people to fill them in unless they are already VO enthusiasts and even then they often just point you at web sites with (usually) far too much detail.

However, I suggest that we start by designing a plain text form which gives examples/selections where appropriate. This could be interpreted and written to xml using a perl script, which would also catch the commonest ambiguities (metre/meter) and unit conversions. We can progress to a web form as long as it is really platform-independent and avoids problems with over-long selection lists, instability if completion is interrupted etc. We are going to have to solve this problem anyway of course for user input! The protocols for submitting data sets to CDS are one precedent, I would welcome comments from people involved with that.

The AstroGrid Registry work-group have created a set of Resource Registry schemas for AstroGrid. These are based on RSMV7 with a few additions suggested by trying to use them to describe four real datasets. I apologise for the baby xml, I am trying to learn - all mistakes are my responsibility alone. I also apologise for possibly reinventing (but less adequately) the schemas linked to http://www.ivoa.net/internal/IVOA/IVOARegWp03/MDinXML-Summary.html - however I think I am covering a small part of this in more detail. I also note that these are based on RSMV6 which explaiins some of the differences in organisation.

 You can find my schemas for AstroGrid at http://wiki.astrogrid.org/bin/view/Astrogrid/RegistryIt02Schema - see a little way down the page:



"Iteration 2 resource registry schema

...
resourceRegistry.xsd and the component schemas for describing an astronomical/solar/STP resource: identity.xsd, curation.xsd, content.xsd, service.xsd."
...

"Examples of the identity, curation and content xml files (in a single file) have been prepared for the 1XMM (x-ray sattelite), SURF (Solar), USNO-B (reference stars), WFCSUR (Isaac Newton Telescope survey) archives."

and
http://wiki.astrogrid.org/bin/view/Astrogrid/RegistryUnits which explains where/why I have added to RSMV7. In summary, the differences are:

CURATION

  1. I have added some elements to describe the size of data sets - in Mb, and for tabular data nRows/nCols, or nPixels for 'image' data (extensible to any number of dimensions). This is to aid optimising the order of query execution and in case servers have limits on the size of data which can be returned/need to invoke a cutout server for images etc.

CONTENT 2) Added element for UCDs - this will be for dumb matching at first,

   can become more sophisticated or moved to a different level as UCDs    become more sophisticated.

3) Added spatial region Healpix - this is the CMB way of indexing the

   sky, added at the request of the Planck people. At the coarsest    there are 12 regions.

4) In a future iteration we should extend region of regard to the

   spectral and temporal regimes. NB I don't think this is the same    as resolution in most cases; for source lists the error may be    greater than the resolution (e.g. systematic errors due to    reference source position uncertainty) or less (point source at    good signal-to-noise); for images the spatial size of a single    image is the same as the resolution for e.g. 1D radio spectra, but    not for a radio synthesis or a CCD image.

5) Added UNKNOWN to cframe types, spectral waveband coverage, might

   want this elsewhere as well. At present this is mainly because I    do not know how to deal with solar data but it might be a useful    general distinction between 'exists but unknown' v. 'NULL'.

6) In a future iteration, add after object count coverage etc., the

   spatial fraction of the BOX etc. covered by images, and similarly    for spectral and temporal coverage.

7) Added Resolution (spatial spectral temporal)

8) Added Data Quality (spatial spectral temporal)

Other future additions

SERVICE 9) Added maximum image size allowed by service to the restrictions.

   There could be more, e.g. maximum time interval to search etc.

best wishes

Anita

Received on 2003-06-17Z21:10:53