Re: SSA working draft

From: Doug Tody <dtody-at-nrao.edu>
Date: Tue, 21 Nov 2006 20:42:42 -0700 (MST)


Hi Alberto -

> About 1.1 Architecture
> ----------------------
>
> I do share Inga's position on the fact that SSA should not force
> -and I'm sure that is not the intent- a data provider to only
> answer to a SSA query with an on-the-fly generated, so called "virtual"
> dataset. It is completely up to the data provider to come up with
> a best schema to comply to the VO, and that could very well be
> a "real" static and fully compliant dataset.
> That second sentence in 1.1 could be removed from the document without
> causing any damage, isn't it?

It is true that for "archival" data (no subsetting or filtering), the data returned does not necessarily have to generated on the fly; it could be precomputed and cached, and we can revise the text to be more precise in the description of the role of virtual data in the interface.

Nonetheless this static file business is a rather limited approach to things, and many aspects of SSA require on-demand generation of the data products. In addition to cases like cutouts or spectral extraction, a good service should be able to return data in any of several formats, and should be designed to be easily updated when a new version of the interface is released. In general when a new version of SSA is deployed, one will want to keep at least one old version around for a while, and this could require duplication of both the service and the entire data collection if a static file approach is taken. As you say, it is up to the data provider how to manage all this, but all these cases could be much easier to manage if these rather small datasets are generated on the fly.

> 1.3.2 Parameters
> ----------------
>
> "If the same parameter appears multiple times in a request
> the operation is undefined."
>
> This basically excludes the ability to provide a logical "OR"
> which is normally implemented using the "multiple choices" mechanism,
> (as in a "multiple select" or in the "check buttons" web form elements).
>
> Isn't that a pity? How would SSAP support a logical OR in a query?
> Do we have to wait until ADQL is in place to see that implemented?

To some extent the "or" mechanism can be provided by a list-structured parameter, which defines a set of acceptable values. The basic "and" mechanism is already provided by having a set of parameter constraints. This satisfies most simple queries. If it gets much more complicated then an expression based interface (ADQL) is the way to go.

That said, we can do anything we want with multiple instances of a parameter so long as it is well defined what to do in this case. What generally happens currently (where this is undefined) is that the service either returns an error, or it silently overrides the parameter value when a new value is specified, in effect providing a mechanism to override a default value. Both of these actions are as valid as defining multiple values to imply an "or". Another possibility would be to have the semantics be defined on a per-parameter basis.

I don't have any strong opinion on this one, so long as the semantics are logical and well defined. A multiple instance mechanism which translates into an equivalent list-structured value is possible for example. (At this point, this is another semantic detail which we carried over from OpenGIS/WMS).

> CreationType
> ------------
>
> The end user wants to know what kind of processing was applied
> to the data; hence the user should be told if the data were binned
> or mosaic'd, etc.
> What is not clear to me is why the SSA service should describe only
> the part of the processing it is responsible for, as the word "Typically"
> indicates in the second sentence of 2.4.2, and as indicated in the very last
> sentence in 2.4.2 (which actually contradicts that initial sentence
> by forcing the creationtype to express ONLY operations happening
> during the VO access).
>
> Wouldn't be better to describe the entire end-to-end process that brought
> the data in the status they are when they rich the user's disk?
> Otherwise, what is the value of such information?
>
> Unless the intention is to notify the VO user that the same data
> *in different form* exist somewhere else, in case s/he is not happy
> with it. If that is the case, then I would suggest a simpler "original"
> as opposed to "reprocessed" keyword, and forget all the quite artificial
> distinctions.

This is one of the more difficult points of SSA (as is the next one below). I agree that this is a difficult issue and am not yet certain either what is the best approach.

One point here is that often the user does know something about the original data product, and may want to know what the service has done to produce the data product which is actually delivered. A use-case I had in mind here was access to complex data, e.g., a spectral data cube. It is useful to know if a spectra was produced by on-demand extraction from a spectral data cube, as opposed to, for example, return an entire dataset from some well-known spectral data collection (the "archival" case). In this case we have one well defined "original" data product (the survey cube) and we can view it is multiple ways, via 2-D or 3-D cutouts, via reprojection or a general slice specified in 2-D, via filtering by spectral bandpass, via extraction of a 1-D spectrum, and so forth. A good scheme which describes the creation of data from a source data product can deal with all these cases (this is more general than just SSA but that is the point here as SIA V2 is next up).

Another important case is where we have a well defined data collection which has already been carefully processed - the usual survey or instrument data collection for example - and the service generates a virtual data product from this by either cutting out a subset, or for example, reprojecting the data onto a standard coordinate system (changing the spectral dispersion in this case). Which was done is quite important to know: do we have the original pixels/samples painstakingly generated by the well-known survey data collection, or is the service filtering or interpolating, and thus degrading, the data samples, to better represent what we asked for? (SIA V1 already addresses this in a rudimentary fashion by the way).

On the other hand, I agree that in the most general case where the original data (as defined by the DataID metadata in SSA) is not well known, or we are doing a large scale automated analysis where knowledge of well known data collections cannot easily be used, what one wants to know is something about the overall processing done to get to to the data actually returned by the service. Of course, this can get quite complex to describe, and if it gets too complex, it won't happen and we fail. We can hope to describe what the service does, but we aren't able yet to describe all the prior processing done as well.

I don't have a perfect solution to this problem yet either. The scheme proposed is more or less adequate to describe data access operations upon well defined data collections, hence may be a good starting point, however I agree that have not yet fully addressed this problem.

> 3.3.1 Input Parameters
> ----------------------
>
> I think the following two sentences contradict each other, or are
> at least confusing to the reader (me!).
>
> Early in the text:
>
> A. "if a given parameter is not specified or is not supported by the service,
> a logical value of "all" is generally assumed."
>
> At the bottom of 3.3.1:
>
> B. "where a specific value is specified for an attribute which is undefined
> for a given data collection, the service should respond by finding
> no matching data."
>
> Apart from the contradiction, I like B, and do not like A.
> Returning too many results is much worse than telling the user:
> please refine you query because our service does not support
> the input parameter you used.
> Also, "A." covers two very different cases:
> A1. a parameter is not specified
> A2. a parameter is not supported
>
> In the A1 case, I would agree that "all" is generally assumed.
> In the A2 case, the service should better bail out a warning to
> the user.

I agree this is a pretty subtle point, but I don't think this is a contradiction. The key point is that in case B,

  1. the parameter is explicitly specified,
  2. the parameter is supported by the service,
  3. the value is *known to be undefined* for the data.

Hence for theory data (for example), where time of observation is undefined, specifying an explicitly specified time of observation should find nothing - the time value is "known" for this data (it is known to be undefined) and does not match the query (except in the case that the theory data simulates a given actual observation time or epoch).

This is different than the case where query by time is merely not supported by the service: in this case the service does not know the time or does not support query by time and hence merely cannot apply the query constraint. Hence it matches data ignoring the constraint, leaving it to further processing upstream to resolve the matter, possibly by rejecting, refining and resubmitting the query.

The problem with a service aborting if a query constraint is supplied which it does not support is that an essential design requirement for DAL queries, in order to be able to support global multiband data discovery and access, is that we can pose the *same* query to multiple services and expect it to work; further query refinement, if required, can occur on the client side, where much greater knowledge of the problem to be solved is available, and further examination of the metadata returned is possible. The alternative would require that service metadata for every service be examined and the capabilities of each service be understood enough by the client to enable tailoring of the query for each service, which is unworkable.

> SPECRES
> ----------
>
> I think a SPEC_RP is the new suggested keyword for a lamdba/d(lambda),
> which is called resolving power, and not resolution.
>
> Maybe a way out is to let the data provider to choose whether
> a FWHM (SPECRES) or a L/dL (SPEC_RP) suites better her data?

Resolving power is the more correct term here, although I think spectral resolution in RP units is also commonly used. Anyway, you are right, we should probably call it the spectral resolving power.

As in other cases we always want to simplify the interface rather than add more features, so having two ways to specify essentially the same thing is probably not justified.

> Units
> ---------
>
> In various tables (e.g. 3.3.3) the unit DDEG is mentioned,
> to mean decimal degrees. I do not think that is an agreed standard.
> To avoid troubles, I suggest you change that into "deg".

Ok; guess we should not confuse units and format.

> Inconsistency about fully compliant services
> ---------------------------------------------
> 3.3.3 initial sentence states that "all" the "should or may" parameters
> are required for a fully compliant service. I think that is wrong
> and does not match with 1.4.1.
> Only the "should" parameters are required for a fully compliant service,
> isn't it?

Yes; this is stated incorrectly (fully compliant probably included the optional parameters in an early version but that is now thought to be too stringent).

> Ranges
> ----------
>
> 1. Why not allowing ranges in all (at least numerical) fields?

While in theory this might be nice and consistent, it may not make sense for all parameters, and supporting a range list complicates the interface (e.g., for the range list in BAND we currently have this as an optional capability).

Supporting a single value where a range is permitted can also be useful in its own right, e.g., to specify a point within a bandpass, rather than an explicit bandpass ("give me anything which contains this value" vs "give me anything where this range intersects the actual data range").

> 2. Apparently there is no mandatory order when specifying a range;
> in many examples throughout the entire document one can find
> both:
>
> 1E-6/3E-7 (that is, max/min)
>
> or
>
> 1.3/3.0 (that is, min/max)
>
>
> But when an open-ended range comes along (e.g. /5) that implies
> a very specific order: <= 5 (ie min/max).
>
> Minor point, but if one uses all the time the max/min order,
> s/he could end up getting too used to that, and use /5 to indicate >= 5.

Good point. This issue needs to be clarified, and has come up before; we should have gotten it into this version of the document.

> 3.3.3.15 COMPRESS
> -----------------
>
> Unclear: Is that paragraph saying that
> even if a client asks for compression, the server could return an
> uncompressed file?

Yes. The client just says it is prepared to accept compressed data, and please use compression if it is worthwhile; whether a given dataset is compressed is up to the service.

Received on 2006-11-22Z04:43:45