Asynchronous querying and tabular data

From: Kona Andrews <kea-at-roe.ac.uk>
Date: Tue, 1 May 2007 16:18:00 +0100


Dear all,

Copied below is a useful discussion from a colleague of why access protocols like SIAP and SSAP don't extend so gracefully to large tabular data queries, and why therefore we shouldn't try to make TAP exactly conform to the model assumed by these protocols.

Cheers,
Kona


If I understand the DAL model correctly, an evolved DAL service, e.g. SIAP 2, has generically three parts:

The synchronous query can return, depending on the arguments and the type of service, a data-set (as in Cone Search); or a table listing available data-sets (SIAP, SSAP); or metadata describing the service (all service types, but the metadata content varies between them).

The data-staging operation can be applied separately to the virtual data-set described in each row of the query results. Each such staging is a separate, asynchronous job. My understanding is that the data staging operation controls the production of data; the data staging URL doesn't supply the data stream. During data staging, the requested data accumulate on the server and have to be downloaded later.

The event stream is a way of monitoring the data staging without polling. The data staging itself is supposed to support polling for progress, so the event stream is an optional feature for the client.

This disposition assumes that the query is quick, because the catalogue queried is a simple, short list of data sets. Any lengthy work is done in the data-staging operation. That's a reasonable assumption for SIAP and SSAP, where the image/spectrum catalogue is usually short. It's less safe an assumption for cone search, unless the service restricts the search area. It's an extremely poor assumption for TAP where queries are expensive and data staging less of a problem. (For applications in general, it's a poor assumption since the "query" may not have any scientific meaning. E.g. in extracting a catalogue from an image, where is the "query"?)

I note that, for any service protocol, it's possible to artificially divide it into a query and data staging, and to do the actual computation in the data staging. For a data-processing application, this might mean that the query returns a list of locations of results files but those files are not computed until the data staging; but the results of a computation are not usually independent, so staging one implies staging all of them. For TAP, the "query" might return a table listing a single data set with an access URI of the form http:// whatever/tap/stageData?ADQL=...; i.e. the real query is done during the "data staging"; but this is forced and rather silly, and it requires an extra HTTP operation plus the parsing of an extra VOTable to do something rather simple.

In any of these DAL-like services, the results can be read with or without data staging; their access references exist as soon as the query completes. There doesn't seem to be a way for the client to tell whether the results need staging. Presumably the client gets an HTTP 404 when trying to read immediately a data set that should have been staged.

The results of the synchronous query are assumed to be small enough to return to the client via the control connection; c.f. OGSA-DAI and DSA/Catalogue. This is a reasonable assumption for SSAP and SIAP; a weak assumption for Cone; and a broken assumption for TAP. Delivery of results to third parties is possible if those results are linked from the query results but not possible when the ultimate results ARE the immediate results of the query. Streaming delivery to a third party - where the results are not cached in the originating service - is possible in principal, but only by means of the recipient reading the results URI and waiting; results cannot by pushed. This approach fails if the results do not flow steadily; that would risk a time-out in reading the URL. Therefore, streaming delivery done this way does not fit well with the data staging.

I suggest that in TAP, and in applications in general, we usually have an initial, atomic unit of work that takes an arbitrary amount of time. Once this work is completed, there exist results data-sets that can be immediately downloaded. The initial work needs to be controlled asynchronously, and the downloading of results can be synchronous as the data flow continuously. A synchronous query with asynchronous data-staging is exactly the opposite of what is needed.

In summary, the DAL model of services is well fitted to the special cases of image and spectrum servers where:

The DAL model is a poor fit to any other kind of service. It can be patched up to serve other cases, but it is fundamentally wrong for TAP and for general applications. For TAP, I suggest that we apply the UWS pattern directly to produce asynchronous controls for the query.

-- 
Kona Andrews        kea-at-roe.ac.uk
AstroGrid Project   http://www.astrogrid.org
IfA, Royal Observatory, Blackford Hill, Edinburgh EH9 3HJ
Received on 2007-05-01Z17:18:15