Strawman TAP Protocol D.Tody, April 2007 The following is a high level analysis and proposed concept for the TAP protocol, approaching the problem from the point of view of consistency with the other DAL protocols, with the intention that ultimately TAP, SSAP, SIAP (v2), etc., will form a consistent family of data access protocols. What we have done is start from the basic DAL service profile (set of operations) and see how far we get in applying this to table data. The TAP service profile can vary from the standard DAL service profile if necessary, but ideally we would like to have as uniform a service profile as feasible for all the DAL services, e.g., to provide common semantics, and allow code to be shared within implementations of multiple DAL services. In addition to the service profile, so far as possible, TAP and the other DAL services should also use the HTTP protocol consistently. Hence the details of how we specify service endpoints, multiple service operations, parameters and their arguments, error responses, etc., should be consistent. Although it is pretty far along in the process to change SSAP, if there are any remaining issues we would like to identify them ASAP as the final draft V1.0 SSAP interface is still being prepared. Ultimately we would like all the second generation data access interfaces to be uniform in approach, so far as possible. My analysis of mapping the table access problem onto the DAL service profile took the following goals or issues into account: o Start with the standard semantics for each service operation and see how well they work for table access. o Need to be able to query table data. o Need to be able to query table metadata. o Need to be able to query service metadata. o At the most basic level, a table query should not be much more complicated than a cone search is now. Ideally even a simple ADQL-based query could be almost this simple (e.g., we just add an ADQL= parameter where is a URL-encoded ADQL expression). o Ultimately, we need to be able to handle long-running asynchronous operations (large queries, VOSpace linkage etc.), much as for SIA V2 and eventually other DAL interfaces. This analysis follows on from a discussion a subgroup of us held at ESAC during the Spectroscopy in VO workshop in late March. Basic DAL Service Profile -------------------------- Here is a more detailed summary of our current standard service profile, including the usual semantics of the operations. queryData Find data which satisfies the given query parameters. Returns standard metadata describing available datasets. Can describe virtual data which may be generated at access time. Used for both data discovery, and to negotiate with the service on the details of any virtual data to be generated. Provides an access reference which can be used to retrieve a dataset. Access reference can also be used as input to stageData. This operation is synchronous, repeatable, idempotent, scalable. Output is tabular (each row describes a single dataset) Output conforms to a standard model, formatted as a VOTable getData Used to retrieve a previously described dataset. Only a single dataset is returned at a time. Data may be generated on the fly at access time. This operation is synchronous, repeatable, idempotent, scalable. Output is a single dataset, in a variety of data formats. Large datasets may be returned in a streaming transfer. stageData Used to initiate production and/or transfer of data products. Available data products are described in a previous call to queryData. Processing is initiated by the call, and proceeds asynchronously. Data products are usually virtual data, computed on the fly. Multiple datasets may be processed in a single job. This operation is asynchronous and changes the service state, initiating an asychronous job which runs on the server. Output is one or more datasets, in a variety of data formats. Data may be "staged" on the server and subsequently retrieved with a standard, synchronous getData call (the client is notified when the data is ready to be retrieved). Data may also be returned directly by the service, e.g., by delivering the dataset directly to a designated VOSpace. Polling and/or messaging may be used to track job execution. (This is being developed in collaboration with the G&WS WG) getCapabilities Used to query the service capabilities, including service version, service functionality (capabilities and limitations), interface, available outputs, available data collections, coverage. Only service metadata is returned - not dataset metadata. (This is being developed in collaboration with the Registry WG) This operation is synchronous, repeatable, idempotent, scalable. Output is heterogenous, and is returned as structured XML. Service metadata may be cached in the registry, hence compatibility with the registry is desirable, so that the same client interface can be used in both cases. getAvailability Used to verify that the service is up and running. The VO "grid" periodically queries each service to see if it is up and running. Details are TBD (but this one looks very simple). (This is being developed in collaboration with the G&WS WG) In typical implementations, almost all of this can be shared by different types of DAL services. Operations such as stageData, getCapabilities, and getAvailability can possible be completely shared (although the metadata returned may vary). Much of queryData can be shared; the main differences are the metadata returned, and the request parameters may vary. Many things, e.g., the error response mechanism, parameter handling, etc.., can be the same for all service classes. Proposed TAP Service Profile ----------------------------- In comparing the generic service profile to what is required for TAP I find that it pretty much works; the most notable difference is that queryData is more complex than for the other classes of data, due to the need to return more complex dataset/table metadata, and the need to support ADQL expressions. Essentially everything else works with no important changes in concept or function. We could split some aspects of queryData off into multiple operations, but it is not clear that this is warranted; a uniform interface is desirable for the client, and will increase code re-use in implementations. What we end up with is quite a bit different than cone search or SkyNode, but probably that is inevitable no matter what we do. queryData I suggest that this should be the main service operation for both data and metadata queries, providing a uniform interface for both. We could support a registry-compatible XML format as an optional output format for table metadata, but the default metadata output format would be VOTable as for table data queries. Queries can be either parameter-based, or ADQL-based, with the output being the same in both cases. There are two types of queries against data tables. In an "immediate mode" query, table data is returned directly in the query response (as for a cone search). The client submits the query against the data table and gets back a response; nothing further is required (some have suggested that this be added for services like SIA as well, but it is better defined here since there TOP ranking is required). The second type of data table query does not return table data directly, but instead returns *dataset metadata* for the virtual dataset (table) referred to, including a conventional DAL *access reference* which can be used to retrieve the table. This is a direct analogue to SIAP, SSAP, etc. There are several reasons for suggesting this: - This provides a way to return generic dataset metadata such as general Table metadata (e.g., number of table rows and columns), plus non-table specific general dataset metadata such as DataID, Char, etc. General dataset metadata is probably just as useful for tables as it is for other forms of data, e.g., for automated data selection. - In a query against a service which provides access to many tables (an advanced option only provided by more sophisticated services), this approach would provide a discovery mechanism, which can return uniform metadata describing each (possibly virtual) table. The acref could be used to express a subset ("cutout" or selection) of the reference catalog based on the input to the queryData operation (e.g., POS/SIZE, BAND, TIME, etc.). As for other DAL services, what is in the acref is up to the service. - The use of an acref to "tag" potential table-generation jobs provides a mechanism to plan large queries. The query would include an estimate of the number of rows to be returned, the size of the output table in bytes, and possibly the execution time required to perform the operation. The client could use this information to further refine the query, ultimately issuing a job request to the service with the stageData operation, identifying the "job" by its access reference. I don't see the use of an acref as being all that much different than, e.g., an acref which references a cutout of an image. For small tables the acref could return an "archival" (entire) table. For larger tables the acref could contain a query, but that is similar to what is done for an image cutout. Note that arbitrarily large datasets can be returned via a streaming (synchronous) getData, so long as it gets underway before timeout occurs. NOTE, in the simplest mode of usage, no acref is required, and we can have an "immediate" query against a single table, as for a cone search, except that model- or ADQL-based queries are possible. To access table metadata we have two types of queries. One is what is described above: this returns generic dataset metadata describing the output table (which may be virtual). The second kind queries table metadata directly for a specific table. The output is a table describing the columns of the table, or possibly other table metadata. This is just like a data table query, we just query a different logical "table", and the data model of the returned table metadata is standardized and defined in advance. The basic queryData operation, posed against the entire "table set" managed by the service, already provides a means to list the tables supported by a service, providing uniform table metadata for each. queryData would normally be a GET, but a POST option could be provided for cases where the ADQL expression is large, or references auxiliary data such as region mask or uploaded user table (e.g., a user-supplied list of positions). getData This is for any other DAL service, except of course in the details of how the output dataset is generated. Note that a separate getData is NOT required for a simple "immediate mode" invocation of queryData. In a simple case, such as accessing an entire small table, getData is trivial. The acref points to the table dataset, and it is returned in the desired output format. In the case of "staged" data, or streaming transfer of large tables, the acref would return the data (the service has to remind itself somehow what the acref refers to), or it would return an error, e.g., if staged data is not yet available. If the server caches staged data locally, the output table is a large file, and can be returned via a simple getData operation once it is available on the server (the detail of notification of the client when the data is available, and deletion of the data, are details of how the stageData stuff works). stageData We think this is essentially the same for all DAL operations, as this does not depend upon the type of data being accessed. The stageData operation (probably implemented as a POST) commands the server to generate one or more datasets (e.g., perform a large query), and either cache the data locally, or deliver it to a designated VOSpace. Polling and/or streaming HTTP messages can be used to monitor the progress of a job. If VOSpace is not used, a conventional getData can be used to retrieve a dataset once it is available. Specification of stageData is being coordinated with the GWS working group. getCapabilities As for any other DAL service. Describes the service capabilities, in XML, compatible with a Registry VOResource. The service metadata is specific to TAP and the features of ADQL (or XMatch service etc.) supported by the given service instance. Specification of getCapabilities is being coordinated with the Registry working group. getAvailability As for any other DAL service. There do not appear to be any TAP dependencies. Specification of getAvailability is being coordinated with the GWS working group. Discussion / Issues ------------------- Our approach here considers *service metadata* and *dataset metadata* to be two different things. Service metadata describes an individual service, including its interface and capabilities (optional features implemented, max size query region or response, etc.), and also identifies the data collections available from the service. Detailed information on data collections is provided separately by the registry. Dataset metadata describes an individual, real or virtual dataset. In the case of TAP, a Table is a single primary "dataset", like an Image or Spectrum. The getCapabilities operation is used only to return service metadata. Dataset metadata is returned separately by the queryData operation. Having a uniform interface to query both table data and table metadata, both of which are returned as tables, is highly desireable from the point of view of a client application, and for implementation of client software, as the same interface can be used for both. At the service level this provides a simple, uniform interface, as the same interface can be used for both table data and metadata. It is also consistent with the relational model upon with TAP is based. Whether metadata queries should be limited to simple fixed queries expressed by a parameter (e.g., return a description of all table fields), or should permit use of the query language as well, is TBD. Since in this proposal a uniform query interface is used for both table data and metadata, the default format for returning table metadata is VOTable. However, queryData can support a range of output formats, for example including CSV and XML. The XML output option could allow table metadata to be returned in a format which is compatible with a tabular resource in the registry. Table data queries can be based either on a standard data model, or can be posed directly in terms of the fields of a specific table (another option might be to pose the query in terms of UCDs used to indirectly indentify table fields). In the case of a data model-based query, the details of this data model are still TBD. A reasonable starting point is the generic parameters identified for dataset queries in the SSAP interface, e.g., POS/SIZE, BAND, TIME, and so forth. For many source catalogs derived from image data, these can provide a reasonable starting point, especially if multiple catalogs are queried, where selection based on spectral or time coverage would be useful. At first glance, use of access refs for table data may seem odd, however this is not required for simple table access; a basic queryData in "immediate" mode is like a cone search, but with the possibility of additional query parameters or use of ADQL. Table data is returned directly in a single operation, as for cone search. A conventional access reference can however be useful for discovery and retrieval of entire static tables as "archival" datasets; a non-immediate queryData can also be used to return uniform metadata for table datasets (including DataID, Curation, Characterization, etc.). In addition, some such mechanism is required to identify and "tag" potential batch jobs for large queries. The query response would estimate the resources required to execute such a large query, and would return this in the output "dataset" metadata, along with an access reference which could be input via stageData to initiate the large query as an asynchronous job. This would provide a means to both estimate large queries, and execute such a query once the client has determined what it wants to do.