Re: Asynchronous querying and tabular data

From: Doug Tody <dtody-at-nrao.edu>
Date: Wed, 2 May 2007 08:35:39 -0600 (MDT)


On Tue, 1 May 2007, Patrick Dowler wrote:

> On Tuesday 01 May 2007 21:02, Doug Tody wrote:

>> The result, if a large query is attempted synchronously, is truncation
>> or an error response; alternatively, we for serious large queries we
>> have a two-stage operation involving estimation and job submission.
>> This is basically what queryData/stageData concept already provides.

>
> I am afraid you have lost me here. I see no reason to infer that
> queryData is some sort of estimate on the work required to do the
> real thing. In SIA it is a query and returns the query result. It
> happens that the query result itself describes something else and
> one column (hopefully) contains a URL to the something else. It is
> not an estimate.

In SIA/SSA etc., when used to access virtual data, queryData represents a contract between the client and server, specifying for each row of the output table, a data product which could be produced. This can be referenced back to the service, e.g,. with stageData, to have it go off and do the computation.

Currently the query response only supports synchronous data access, so there is no indication of the computational cost or time-to-run of a job (although the output dataset size is estimated). However, as we add support for asynchronous operations to the DAL services, we just need to add this information, for services which support async operations as an optional capability. The query response can tell whether or not a dataset can be computed synchronously, and if not, estimate the size of the computation required to produce it. The client can then either repeat the query to refine the job specification (e.g., ask for something smaller), or stage the request.

The concept with stageData is that it references one or more of these virtual data products (tasks?), and initiates a single batch job to compute all of them. The job might compute only a single computationally intensive dataset, or it might compute thousands of smaller datasets in parallel. For each data product, the stageData request will also need to specify disposition, e.g., is the data to be staged locally, or delivered to a remote VOSpace. If data is staged locally, a streaming GET (normal synchronous getData) can be used for retrieval, even of very large datasets.

> TAP queries may contain a column with a URL to something, but the
> standard case is that the query result is something in its own right
> and not generally the first of two stages of work. In this light,
> I think it is a perfectly reasonable interpretation of typical DAL
> style to say that queryData is a synchronous method that returns a
> query result.

Right; in a simple TAP query against a data table, probably the operation should be synchronous, and return the table data directly (and this will be enough for many queries). If this mode is used for a large query, probably all we can do is truncate the result, or return an error. In that case there is probably no alternative to a two-step process of estimation followed by a staging request.

Received on 2007-05-02Z16:46:14