Re: TAP and large resultsets

From: Doug Tody <dtody-at-nrao.edu>
Date: Sun, 28 Jan 2007 22:15:13 -0700 (MST)


Hi Kona, All -

I agree that a fully streamed query could be a powerful way to deal with large queries, and we should consider supporting this. However, a fully streamed query is not fully general (e.g., no ORDER BY or anything else which requires management of the full result set on the server; can't handle all cases), and it is semantically complex for the client to be able to deal with potentially very large query responses. Another key point with paged queries is that we do this in part to attempt to make things more responsive given a slow Internet. Often the client will abort the whole operation after receiving the first page or so of the result set, and repeat the operation with different parameters.

For very large queries we need advanced techniques such as use of asynchronous operations and VOStore, or a streaming query. For "modest" size queries one might define a upper limit for the size of the result set managed by the server without resorting to the more complex managed techniques, plus some options for how the client can get at the result set. This could include either making the upper limit small enough to return it in one go in interactive times (the most basic interface, a la cone search), or some scheme based on automated server side caching. If the result set is cached on the server, then it can be returned either via paging or via a streaming transfer (as we already do for other large datasets such as images).

So long as the server does not have to manage writeable storage on behalf of a client, caching result sets on the server is not necessarily very complicated. TAP already assumes that a DBMS is involved, so it is not so difficult to store a result set in a temporary table managed transparently by the server, and deleted after some interval.

I agree that for the simplest possible service we probably do not have to require that it support paged queries, however, having a simple way to deal with queries up to the point where we get into grid techniques, while still providing reasonable interactive performance, is important.

On Tue, 9 Jan 2007, Kona Andrews wrote:

> Greetings colleagues,
>
> Happy New Year and I hope you have had a restful and productive holiday.
> Mine was productive of two colds and a mild case of flu, which just goes
> to show what a bad idea it is ever to stop working ;-)
>
> Prior to our telecon next week, I wanted to raise a point about the
> TAP protocol, in particular about having a paged interface for large
> queries (so the user can bring back the query results in small ordered
> chunks), as we briefly discussed last time.
>
> First, some background.
>
> In AstroGrid, we have a deployment-oriented remit whereby part of our
> goal is to get our software deployed in "third-party" institutions (i.e.
> by people outside our own team/locations), and ideally in *all* UK
> institutions. Two things that deployers emphasise as critical to
> them are:
>
> 1. Components should have a low installation/maintenance cost in
> human time (and I acknowledge we still have much work to do here!)
>
> 2. Components should have a low resource requirement
>
> (In other words, "we'll deploy it as long as we don't have to do very
> much and it doesn't require any additional hardware; we have no time
> and no money." Etc. Fair enough.)
>
> In the case of the Astrogrid DataSet Access (DSA) component, the
> architecture was very carefully designed to be fully streaming (partly
> to reduce resource requirement, and partly to ensure an architecture that
> scaled to the very large queries envisaged as a normal event in the VO).
> In other words, in the course of processing a query, the query results
> never need to be cached in memory or on disk. This means, for
> example, that a DSA running in a tomcat with (e.g.) 64Mb of memory
> and no additional "scratch disk" resources can successfully return
> multi-*gigabyte* query results files to VoSpace, if requested.
>
> This fully-streamed approach has additional benefits, in that the
> component is not vulnerable to the filling-up of disk caches and there
> is no disk-maintenance load (flushing old files, managing quotas etc).
> However, the streamed approach has implications for offering results
> paging as a part of TAP - namely that, since the results are not cached
> anywhere, each time a page is requested in a TAP query, the full query
> must be (re-)run and only the relevent subset of results returned to the
> user.
>
> While inefficient, this is obviously not impossible to implement, and
> we can certainly implement paging as part of our TAP support. However,
> I am strongly opposed to making the paged interface *compulsory*.
>
> Our observation with "real deployers" of AG software has been that, if
> an AstroGrid component starts to hammer too heavily/obviously on an
> institution's resources, then the institution responds by wanting to
> disable it (perhaps I should have added a point 3 above: "Give us
> any trouble and you're outta here..."). For example, some AstroGrid
> deployers have specifically disabled the conesearch interface on their
> DSAs until conesearch efficiency improvements are in place [mea culpa]).
>
> If paged support in TAP is *optional*, then we can provide a mechanism to
> selectively disable it. Then, if an institution finds that paged
> querying is clogging up the database because of the repetition of
> intensive queries, they can switch the *paging function* of TAP
> off (or limit/throttle it in some way), but still support e.g. simpler
> unpaged queries. However, if paging is compulsory in TAP, then they may
> just switch the whole TAP interface off - or maybe the whole component -
> to the greater detriment of the users who then can't run queries at all.
>
> I realise that it may seem that I'm driving the interface protocol spec
> based on a particular implementation (our streamed DSA component).
> However, I do honestly believe that a streamed architecture for querying
> is the only sensible choice for scalability (handling arbitrarily large
> results and arbitrarily large numbers of simultaneous queries); anything
> based on disk caching is always going to hit the limits of the available
> disk cache at some point - sooner rather than later if deployers are
> stingy with resources.
>
> All the best,
> Kona
>
Received on 2007-01-29Z06:15:46