Issues relating to simple data and metadata queries

From: Doug Tody <dtody-at-nrao.edu>
Date: Sat, 28 Apr 2007 14:12:04 -0600 (MDT)


Hi All -

The TEG discussions relating to TAP appear to have stalled, which is unfortunate; it is hard to see how we can hope to make any progress as a group without discussion. I will make one more try, looking at the issues relating to the SQL92 information schema (metadata queries), as Alex suggested in our telecon yesterday. To try to provoke some discussion, I will identify some key issues where it would be good to have a consensus within the TEG, and expert advice from individual TEG members, in order to further advance the TAP design. It would help the TAP effort enormously if you could please voice your preferences or opinions for each of the items labeled ISSUE below.

If we can understand the requirements for querying table metadata then we understand most of what is required for a basic TAP interface, as the core functionality required is to query table data and metadata. Other issues such as data staging etc., are also important, but are required only for advanced services.

Database and Table Metadata


I think everyone here is probably already familiar with the information schema: it allows the standard SQL mechanism to be used to query database metadata as well as table data. Hence to get a description of the columns in table A, we can do something like SELECT * FROM INFORMATION_SCHEMA.columns WHERE TABLE_NAME="A". Similarly, one can query for a list of tables, views, stored procedures and their parameters, privileges, etc.

The Registry description of tabular resources provides a much simpler but similar mechanism, providing a way to describe a catalog of tables, or the columns of a table. The column metadata provided is similar to what VOTable defines for a table field.

Scope of Metadata


If we consider the types of sophisticated queries which will be made possible with ADQL, it is not clear how much metadata is needed to plan and compose complex queries.

ISSUE 1: To compose ADQL/SQL queries using TAP, do clients need

     information comparable to the SQL information schema (plus some
     additional astronomy-specific metadata), or is a simple list of
     the tables and columns managed by a TAP service (such as provided
     by the registry resource model) enough?  (Views for example, could
     be useful to describe what is in the condensed TAP verbose levels,
     and might be useful for viewing subsets of very large tables).

A related question is what the information provided by the registry is to be used for. As Ray noted in his mail yesterday, there are use cases which require sufficient information in the registry to enable discovery and planning of operations which use table services. One would like to have enough information stored in the registry to enable service discovery and perhaps some high level query planning (e.g., what type of information to extract and whether a given service provides it), before querying the individual services.

ISSUE 2: What information needs to be stored in (or otherwise available

     from) the registry, to support global TAP service discovery and
     perhaps the first phase of query planning?

ISSUE 3 (Related to #1-2): Is more database/table metadata than that
     required for #2 (discovery), needed to support advanced ADQL
     queries against a single service?


Metadata Content


The next concern is database and table metadata content as provided at the service level (independently of what is required at the registry level; whether these can be the same is TBD). This can be used for a variety of purposes, e.g., data discovery and selection, query planning, and detailed queries against a single table or set of tables.

ISSUE 4: What COLUMN metadata is required? Is a VOTable-like set of

     attributes sufficient to describe each column?  Is any additional
     SQL information schema metadata needed?

ISSUE 5: What TABLE metadata is required?  This can include
     table-specific metadata (table name, number of rows and columns,
     type or class of table, etc.), dataset metadata (DataID, Curation,
     Characterization, etc.), access metadata (e.g., to retrieve the
     entire table), and possibly other things.  We use generic dataset
     metadata to describe other types of data in the VO; how useful
     would it be to provide this for table datasets as well?


Form of Interface


The core TAP functions are to be able to query table data and table metadata. There are at least two possible approaches to this: a uniform interface for both, as for the SQL information schema (i.e., database metadata is represented as any other table), or separate interfaces, possibly with different inputs and outputs.

Option 1: Information Schema Approach

In the uniform interface approach the same mechanism is used to query both data tables and metadata tables. The information schema is used to give metadata tables predefined names, e.g., "INFORMATION_SCHEMA.tables".

For example (shortening this to just "SCHEMA" for brevity and ignoring protocol issues such as HTTP formatting, GET/POST usage, the details of MIME types, etc.), something like the following, could be a simple data query:

     queryData(query="SELECT * FROM A WHERE rms > 3.0", format="csv")
     queryData(table="A", rms="3.0/")

Using the same interface, something like the following could provide a simple metadata query mechanism:

     queryData(table="SCHEMA.tables")
     queryData(table="SCHEMA.columns", tableName="A")
     queryData(table="SCHEMA.views", tableName="A")

In the SQL information schema approach, a ADQL/SQL query against the information schema would also be possible, but may or may not be advisable in the TAP interface so a parameter-based example is given instead.

Option 2: Separate Interface for Data and Metadata

In this case we provide a separate interface to query or retrieve table data and metadata. For example,

     getTables()
     getColumns(table="A")

Even using a different operation with a different set of paramters, this approach could be made very similar to option 1, if the output semantics were the same in both cases.

ISSUE 6: Either of the above approaches would work. Option #1 can

     potentially provide access to a greater range of information
     via a uniform interface, with most of the client and server
     implementation being the same.  Option #2 can provide slightly
     simpler queries, and allows the acceptable parameter set to be
     constrained for each type of metadata query, but requires a new
     service operation for each type of metadata.  Which approach do
     you prefer, and why?  Or, suggest something different.
Received on 2007-04-28Z22:14:50