Re: Issues relating to simple data and metadata queries

From: Iņaki Ortiz de Landaluce <Inaki.Ortiz-at-sciops.esa.int>
Date: Mon, 30 Apr 2007 11:48:43 +0200


Hi All,

I will try to summarize our view on the three different issues: Table Metadata Access, Table Metadata Output Format and Relation Between Registry and Service. Here you can find a reference to what we previously sent as inputs regarding these issues: http://www.ivoa.net/forum/voql-teg/0703/0082.htm http://www.ivoa.net/forum/voql-teg/0704/0126.htm

  1. Table Metadata Access We feel the SQL92 Information Schema does not fit our needs for several reasons. First of all, it is not uniformly implemented by the different RDBMS vendors. SQL Server seems to be the only one that strictly follows the standard while for example Sybase uses a specific table naming convention and even specific commands ('list tables'). Moreover, here at ESAVO we use four different RDBMS: Sybase, Oracle, MySQL and PostgreSQL; and one ODBMS: Versant and you can see in the following page how different vendors describe the Database metadata: http://www.dbazine.com/db2/db2-disarticles/pelzer4 That is without talking about services not using RDBMS as their back-end (XML databases, ODBMS or VOTable/CSV/FITS flat files).

Regarding issues #4 and #5, here is a quick list of the table metadata we feel would be needed. For columns: column name, table name, utype, description, data type, array size, unit, ucd, values range. For tables: table name, description, primary key, number of rows, list of foreign keys. For example, utype, unit, ucd and value range are not provided by the Information Schema.

Because of all the above we think that the mechanism described in option 2 (issue #6) is necessary and sufficient, hence optimal, to retrieve this metadata.

2. Table Metadata Output Format
VOTable would fit for the Column description as all metadata could be included in the FIELD element, but the Table metadata does not fit in the VOTable document. Only Table Name and Description are accommodated within the document. As for using the DATA element to return the information, we would need one-to-n multiplicity to describe the foreign keys for table metadata.

As commented in issue #1, we could use Registry VOResource schema elements to return this metadata with the slight inconvenience of adding a dependency between the two specifications, but the big advantage of reusing an existing IVOA XML Schema. Other option would be to create another specific XML Schema. Either of these two options are fine with us.

3. Relation Between Registry and Service (Issue #2) We think that the Registry should be the source for most of the information metadata and not simply a cache/proxy for it. We consider three types of metadata. First, Resource Metadata (Name, Description, Contact, Subject ...etc), second Service Metadata (URL endpoint, Service Type, Max Records ...etc) and finally Table Metadata (see comment above). We think that both Resource and Service Metadata should originate from the Registry. There is no need to return Resource Metadata from the service (it may be optional). Service Metadata could be useful straight from the service and would be sufficient to make it self-contained. And finally, Table Metadata must be accessible from the service and might be cached in fine-grained Registries.

Finally, for reference please find attached the TAP proposal we sent before the last telecon. Please feel free to comment on it as well.

Cheers



This message and any attachments are intended for the use of the addressee or addressees only. The unauthorised disclosure, use, dissemination or copying (either in whole or in part) of its content is prohibited. If you received this message in error, please delete it from your system and notify the sender. E-mails can be altered and their integrity cannot be guaranteed. ESA shall not be liable for any e-mail if modified.

Inputs folow for TAP draft doc. Comments are given within <--//comment//-->.

3 Input Query Formats
In this protocol specification two input query formats are defined. One aimed to be easy both to implement and use, the other to give complete access to the dataset. Either methods may be implemented or both. As can be seen below, the definition of those query methods do not include the definition of the output formats nor the error handling. These are shared by the two methods and defined hereafter.

3.1 Simple Access Query
The Simple Access Query method is aimed at being easy to implement both for the server and the client. It gives access to a dataset using simple Key Value Pairs to filter the data to be returned, but allows only minimal control over the quantity of information or the quantity of data to be returned. This access method also presents the dataset as a flat, single table structure and effectively hides the inner dataset structure. Hence, usually only the main information on the data may be accessed and clients/users easily reach the limits of the interface. Nevertheless, it is an easy and fast interface to implement and may be a quick access to the data while a more complete interface is implemented if needed.

3.1.1 Query method format
The Simple Access Query method is invoked via a standard HTTP/1.1 GET request to an URL endpoint. The URL endpoint takes the following general form:

http://service.endpoint.saq{?,&}REQUEST=queryData&PARAM=value&PARAM=value

Note that the specific service endpoint may include login information (e.g., “user:pass@”), port number (e.g., “:8080”) and an arbitrary long path. Anchor pointers (e.g., “#anchor”) may be present in the URL endpoint, but they have no specific meaning in this protocol specification for the service.

The endpoint/query params separator may be either a question mark or an ampersand sign depending on the presence or not of a question mark in the specific service endpoint.

The parameters shall be seen as filters for the data to be returned. If no parameters are provided, the service shall return all of its data, within its maximum returned values limit and providing its default set of columns.

If various parameters are provided, they shall act on the data as successive filters applied in the order they are specified in the URL endpoint. Only data that passes all of the supported filters shall be returned.

If a parameter is unsupported by the service, it may be reported as an error or silently ignored. Of course, parameters specified as compulsory shall not be silently ignored and will deem the service as non compliant with the specification.

Parameter names are usually defined and written in uppercase, but they shall be treated in a case-insensitive manner by the service. A good habit followed by most services is to write protocol defined parameters in uppercase and service specific ones in lowercase.

The parameters should not modify the output format of the service (see point 4 for this), they may only limit the quantity of data (how many rows), the type of data (specific values filtered) and the quantity of information (how many columns).

The service may also support this query method using the HTTP POST verb, but this should not be considered standard behavior by clients. The other HTTP verbs (PUT, DELETE, OPTIONS, HEAD, …) are undefined for this query method.

3.1.2 Generic PARAM types
Here we define the generic PARAM types that can be used to filter the data. This should allow implementers to write generic code for input parameters. Then, a few reserved PARAM names are specified, for which additional logic might be needed. Services or protocols that expand on this specification may define additional PARAMs, but they should be based on the following generic types.

3.1.2.1 The single value type
This is the simplest PARAM type. The parameter name points at a single data type (one column) and it must be compared to the value specified using an equal sign operator. The input looks like this: PARAM=value The value can be either a numerical type or a character string type. Hence, for numerical types it would result in the following statement: PARAM_col = value … and for character string types: PARAM_col = "value"

The definition of the parameter may specify that it shall be treated as an upper or lower limit. This is only valid if used with a numerical type value. Hence, for upper limit parameters the statement is: PARAM_col > value … and for lower limit parameters: PARAM_col < value

3.1.2.2 The list value type
The parameter name also points at a single data type (one column), but the value specified is actually a comma separated list of values. The input looks like this: PARAM=value1,value2,value3 The value can be either a numerical type or a character string type. Hence, for numerical types it would result in the following statement: PARAM_col IN (value1,value2,value3)
… and for character string types:
PARAM_col IN ("value1","value2","value3")

For character string types, the list may begin or end by a comma, or have two consecutive commas. This shall be interpreted as the empty string value and must be included in the list of values. For numerical types, this shall not be interpreted as the zero value and may be reported as an error by the service or silently ignored.

3.1.2.3 The interval value type
Here the parameter name still points at a single data type (one column), but the value specified is actually a slash separated interval. The input looks like this: PARAM=valueMIN/valueMAX The value can only be a numerical type. The interval may be open at one end (PARAM=/valueMAX or PARAM=valueMIN/), but it has of course no meaning to open at both end (this is equal to not filtering on this value). This kind of input (PARAM=/) may be reported as an error by the service or silently ignored.

Hence, the resulting statement may be, as appropriate:

PARAM_col BETWEEN valueMIN AND valueMAX
PARAM_col > valueMIN (open upper limit interval)
PARAM_col < valueMAX (open lower limit interval)

3.1.2.4 The interval PARAM type
In this case, the parameter name points to an interval of values (two columns). The definition of the parameter shall specify this, as no indication in the parameter name is given. Also, this parameter type can only be used with numerical type values.

If the value is of the single type, then the filter consist of checking the value is inside the interval. The resulting statement is: PARAM_colMIN < value AND PARAM_colMAX > value If the value is of the interval type, then the filter consists of checking the two intervals cover each other. The resulting statement is one of:

PARAM_colMIN < valueMAX OR PARAM_colMAX > valueMIN
PARAM_colMAX > valueMIN (open upper limit interval)
PARAM_colMIN < valueMAX (open lower limit interval)

3.1.3 Reserved PARAM names
The following list of parameters is by no mean compulsory for services supporting the Simple Access Query method, but if they are needed (or rendered compulsory by a service or protocol definition that expands on this one) they must comply with the following definitions. It is strongly discouraged to implement a similar comportment with any other PARAM name.

Define here POS & SIZE, BAND, TIME, FORMAT, VERB, TOP, TOKEN parameters

3.2 Complete Access Query
The Complete Access Query method is aimed at giving total access and control over the dataset to the client. It uses a full query language (defined in other specifications) to access the data and hence offers much more control than the Simple Access Query method. This access method also presents the dataset inner structure. Hence, all of the information on the data may be returned, but the clients/users are responsible of such things as table joins, order by or selecting the output information.

3.2.1 Query method format
The Complete Access Query method is invoked via a standard HTTP/1.1 POST request to an URL endpoint. The URL endpoint takes the following general form:

http://service.endpoint.caq

Note that the specific service endpoint may include login information (e.g., “user:pass@”), port number (e.g., “:8080”) and an arbitrary long path. Anchor pointers (e.g., “#anchor”) may be present in the URL endpoint, but they have no specific meaning in this protocol specification for the service.

The actual parameter is passed in the message body in the following form: queryType=queryString

Note that it is an error, and it must be reported as such, to pass any other parameter in the message body than the ones defined hereafter or to specify more than one parameter or no parameter at all.

The service may support any number of the following query types. Trying to make a query using an unsupported query type shall result in an error output. Parameter names are usually written as defined hereafter, but they shall be treated in a case-insensitive manner by the service.

The service may also support this query method using the HTTP GET verb, but this is not encouraged as the use of a full query language makes it hard to guaranty the safety of the operation. The other HTTP verbs (PUT, DELETE, OPTIONS, HEAD, …) are undefined for this query method.

3.2.2 The nativeADQL method
This method takes an ADQL string-formatted parameter in the form: nativeADQL=SELECT * FROM table WHERE ?

3.2.3 The uTypeADQL method
This method takes an ADQL string-formatted parameter in the form: uTypeADQL=SELECT * WHERE ?

3.2.4 The directSQL method
This method takes a standard SQL parameter in the form: directSQL=SELECT * FROM table WHERE ?

Any standard or even vendor specific definition of SQL may be used here. This method should be understood as a pass-thru to the database and should be implemented and used with extreme caution.

4 Output Result Formats
<--//
Hereafter is briefly exposed a new idea for output result formats handling. The advantage of this solution being that the exact same mechanism is used whatever input query method was used. The disadvantage being that the output format requested must be in MIME type format. The other solution would be a DISPLAY parameter just as some services already use.

NOTE: we are not here talking about the FORMAT parameter used in S*AP protocols.

A similar comment could be made in this specification concerning compression using the standard Accept-Encoding and Content-Encoding headers of HTTP.

A similar, but less important, comment could be made in this specification concerning charset encoding using the standard Accept-Charset header of HTTP. //-->

The output of the results can take various formats and does not depend on the input query method used to call the service. The standard format, which must be supported by all services, is the VOTable 1.1 format as described below. Additionally to this one, services may support any number of other formats.

The output format desired by the client shall be indicated using the standard HTTP mechanism and MIME types. As such, the client uses the “Accept:” HTTP header to indicate a list of preferred MIME types for the output.

For example: Accept: application/x-votable+xml, text/csv, */*

And the service must respond to a valid query that does not generate any error with a “200 OK” status code and the “Content-Type:” HTTP header to indicate the MIME type of the actual output.

For example: HTTP/1.1 200 OK Content-Type: application/x-votable+xml

If the “Accept:” header is missing from the client call, the default output format is the VOTable 1.1 format as described below. The “Content-Type:” response header shall still be filled out accordingly.

<--//
Here is briefly exposed a new idea for empty output results handling. The advantage of this solution being that the exact same mechanism is used whatever result format was requested and processing power both on the server and the client is spared as well as bandwidth. The disadvantage being that the current clients probably don’t check for this status flag. The other solution would be to return an empty VOTable (or other output format) just as is done right now. //-->

If no data would be returned as a result to the client query, the server should respond with a “204 No Content” status code and no message body. This behavior is preferred to returning an empty VOTable (or other output format with no data). Note that this is not an error status code. The query must be valid and executed without error, but no data resulting from it.

5 Metadata Access Format
Metadata is accessed directly from the service endpoint and should contain all the information needed to construct simple or complete queries. The metadata query methods for an input query method must be supported if the input query method is supported by the service.

5.1 Metadata query methods
The Metadata Query methods are invoked via a standard HTTP/1.1 GET request to an URL endpoint. The URL endpoint takes the following general form:

http://service.endpoint.{s,c}aq{?,&}REQUEST=getCapabilities[&table=table_name]

Note that the specific service endpoint may include login information (e.g., “user:pass@”), port number (e.g., “:8080”) and an arbitrary long path. Anchor pointers (e.g., “#anchor”) may be present in the URL endpoint, but they have no specific meaning in this protocol specification for the service.

The endpoint/query params separator may be either a question mark or an ampersand sign depending on the presence or not of a question mark in the specific service endpoint.

The other HTTP verbs (POST, PUT, DELETE, OPTIONS, HEAD, …) are undefined for this query method.

5.1.1 Simple query metadata
Accessing a Simple Query method metadata is done via two unique endpoints.

http://service.endpoint.saq{?,&}REQUEST=getCapabilities This shall return the information on input parameters that can be used for this service, encoded in the XML “params” structure described hereafter.

http://service.endpoint.saq{?,&}REQUEST=getCapabilities&table=results This shall return the information on the data returned by this service, encoded in the XML “table” structure described hereafter.

5.1.2 Complete query metadata
Accessing a Complete Query method metadata is done via two unique endpoints. The second endpoint has a variable part informed by a call to the first one.

http://service.endpoint.caq{?,&}REQUEST=getCapabilities This shall return the information on the database structure of this service, encoded in the XML “dataset” structure described hereafter.

http://service.endpoint.caq{?,&}REQUEST=getCapabilities&table=table_name This shall return the information on the particular structure of the table “table_name” of this service, encoded in the XML “table” structure described hereafter.

If the table “table_name” does not exist for this service, the server shall return an error code “400 Bad Request” or “404 Not Found”.

5.2 Metadata output formats
All metadata output shall conform to the following formats and validate against its respective XSD schema that can be found in appendix. TODO (to be worked out in coordination with the Registry WG)

6 Error Responses
<--//
Hereafter is briefly exposed a new idea for error handling.The advantage of this solution being that the exact same format of error response is used whatever result format was requested.The disadvantage being that the list of error codes cannot be extended.The other solution would be a VOTable error output just as in the SSAP specification. //-->

Error output shall be supported using the standard error codes from the HTTP protocol. Here is a list of correspondence and explanations of the different codes.

Here are the client error codes:

Here are the server error codes:

It is strongly encouraged to use the message body to return a textual explanation of the error for the client/user. The description should be a plain UTF-8 encoded English text without any formatting tags destined to a human being. Received on 2007-04-30Z11:48:57