A symbolic description of VOOQL

From: Ed Shaya <Edward.J.Shaya.1-at-gsfc.nasa.gov>
Date: Thu, 07 Aug 2003 18:30:07 -0400


We have been asked to work on another topic for the next few months, so we wanted to summarize
some development we had been working on before it was forgotten. We were headed towards showing that the software for object query could in fact be independent
of the scientific field. Just the names of object properties, aggregates, and classes and a set of
transforming functions between properties for a given field is needed to make a single package of software and a single query language run for any given field. It is not clear if we got there or not, but we were certainly getting close.

Ed

VOOQL (Virtual Observatory Object Query Language) A formal description of a generic query system for distributed data about the property of objects. For the objects and properties here, I have in mind astronomical objects and their physical properties right now, but everything here can hold for data objects (say tables with the properties mostly refering to the various columns) or for medical illnesses and their statistical properties, etc. This is truly a metadata independent system. This symbolic description should help us to discuss query issues, user scenarios, describe use cases, and begin work on implementation issues. VOQL could use VOOQL as a basis for astronomical object and data object query.

A general object query contains a logical function, L (and, or, etc) of a boolean grand propery P and a boolean grand function F: Q = L(P,F)

The grand property may be a logical function of other grand properties, or it may be the return status of an atomic property array (p_i), or it may be the return status of an atomic aggegate property array (a_i). P = L(P,P) or P = return_stat(p_i) or P = return_stat(a_i). The return status is 0 if no object satisfies the constraints and 1 if one or more do.

An atomic property (a property, not derived from other properties) has arguments: input object array (O_i),
a property type (pType),
a set of modifiers (M) consistent with the pType, a range of values allowed for the pType, an output object array (Ovar_i) which
lists all of the objects with pType in range. It returns a status and an array variable for the values from all responding resources (r) of the values of the pType for the array of objects Ovar_i.

p_i = p(O_j,pType,M,range_j,Ovar_i) = var_i^r

The response can be an array of rational numbers, integers, or strings. On query, the range_i (rational or integer) can be expressed as [R_j1:R_j2] (although usually it is constant range for each member of the array, [R1:R2]) or, for strings, it is a regular expression.
The d superscript on var refers to another dimension of the array for results from various data centers.

For range >R, we have [R:]. For range <R, we have [:R] and for any rational value, we have [:]. For strings, any string is '*'.

The object array is a set of objects specified either by Ids or a class or the result of some property or aggegate constraint. O_i = O(Id_i) or O(Class).
If the input list is the whole universe of objects then O_i = U

It is expected that has results accumulate from various parts of the overall query the var_i^r and Ovar_i are joined into a results table object that has the unique union of all O_i's as lines and r times N(p) columns.

An aggegate array has arguments; object array, aggregate type (aType) and a set of modifiers. The input and output of an aggegate array is an object array. The input array sets the parent objects and the output array is a complete set of k aType child elements of O_i consistent with M.

a_i = a(O_j,aType,M) = Ovar_ik^r

The r superscript again refers to the differing results from different resourcess. The user will need to apply some algorithm to resolve these conflicting results.

The grand function F is a logical function of two grand functions or the boolean return status of a function of a vector of properties: F = L(F,F) or return(f(p_vector,args)) These functions can be services, predefined, defined in the query, or simply be arithmetic done on the fly.

Some properties may be derived by one or more functions of other properties. p_i = f1(p_vector1, args1) = f2(p_vector2, args2) etc. One point to make is that we have now permitted arbitrary math into every part of this language. Another issue is that a deep search for properties should take advantage of a reasonable set of these derivations and make requests for p_i directly and all of the properties that make up the vectors needed by the functions. The user will need to be consulted as to which of these functions are desirable or worth the extra time and effort. Nevertheless, it is worthwhile including a fairly complete table of property derivation functions into the query system to expand the set of properties that are queryable beyond the set of properties in the actual data.

For example absolute magnitude is not an observable, it is derived from apparent magnitude, distance, or period of variability in Cepheid stars. So one can look into various resources for absolute magnitude, distance, apparent magnitude, and period of variability. However, if a particular resource has absolute magnitudes, it is probably not useful to search the same resource for the others since the property has already been derived probably from those others.

The evaluation of the logical functions L and the return statuses require evaluation of whether var_i^d is in range for each i. This requires a collapsing of var_i^d into var_i. This can be done by selecting an averaging functions, or an extrapolation function to a given time, or some other way of choosing a best value. We discussed in recent e-mails that if one data center responds that p(O) is in range, then the other data centers need to be queried for their value of p(O) to obtain an unbiased best value. Thus, if one does not require the values, but one is just constraining the object list, it is still necessary to send to obtain var_i^d and specify a function to collapse and evaluate each object for constraint satisfaction. the expression is: constraint = collapse(p_i) = collapse(p(O_j,pType,M,range_j,Ovar_i)) and only the status is returned

Since evaluating the query can not proceed until all data centers have responded to candidate objects that satisfy constraints, only fairly simple atomic queries need to be sent to individual data centers. The complex evaluation must be done by some integrating agent. The set of low level queries needed are individual p_i and a_i, the grand properties P need not be distributed, nor do the functions.

A basic premise here is that one needs to distribute O(p) queries to more than one resource and that one does not know exactly what form the information is in at the many resources. If for each property type there is a "best" resource and one just needs to lookup the correct resource to use, then it is not a true distributed data system, it is a disjoint system and one need only do a virtual join to see it as a single resource. However, to create such a system, one has to bring together all of the data on each and every property for each and every object and collapse all data into a single value for each O(p). One accepts some authority's decision on how to properly average and evaluate all error bars and discard outlyers and have unbiased judgement on the quality of the investigators' works. One also has to trust that all best values are up to date and all of the latest results everywhere in the world has been included in a global averaging scheme with all of the data of the past. Such a system runs counter to the benefits of a distributed system where all of the individual data measurements are available for fresh analysis. This is particularly important if there is any time variability, secular or periodic, in the object properties, because this is totally lost in the disjoint system.

Finally, here is a list of simple queries that would typically be distributed to various resources.
What is the value for property pType for object O? p(O,pType,M,[:],var) (ie, the value is a rational number) Is the value for property pType in range R1 to R2? p(O,pType,M,[R1:R2],var)
Is the value for property pType greater than R2? p(O,pType,M,[R1:],var)
Does object O have aggegate aType children? a(O,aType,M)
Does object O have aggegate aType children, and what are they? a(O,aType,M,Ovar_i) (ie, the value is in the universe of Objects)

So the implementation trick for the high level language is to automatically break up the top level grand property into these simple atoms and then when the results for these return evaluate it and present the results. Received on 2003-08-08Z00:30:04