RE: VOTable for simulations

From: Gerard <gerard.lemson-at-mpe.mpg.de>
Date: Thu, 31 Aug 2006 17:19:25 +0200


Dear Claudio
>From the VOTable spec, in particular section 2.2, I gather that they indeed already included
support for multi-dimensional arrays. This seems then indeed the natural way to support at least
uniform grids coming from simulations as well. Some comments:

>From their example I gather that arraysize="41x41x41x3" means "three data cubes of dimensions 41x41x41",
not "one 3D-vector valued datacube of dimensions 41x41x41". "41x41x41" would mean "41 2D datafields of dimension 41x41". I think that therefore a 3D vector field
could/has to be encoded as (for example)

<?xml version="1.0"?>
<VOTABLE xmlns:xsd="http://www.w3.org/2001/XMLSchema"   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"   xmlns="http://vizier.u-strasbg.fr/xml/VOTable-1.1.xsd">  <RESOURCE name=myVectorField>
   <TABLE name="VelocityField" ID="Vel">

<FIELD name="vx" ID="vx1" ucd="phys.veloc;pos.cartesian.x"
datatype="float"

             arraysize="41x41x41x1"   unit="km/s" />

<FIELD name="vy" ID="vy1" ucd="phys.veloc;pos.cartesian.y"
datatype="float" arraysize="41x41x41x1" unit="km/s" />
<FIELD name="vz" ID="vz1" ucd="phys.veloc;pos.cartesian.z"
datatype="float" arraysize="41x41x41x1" unit="km/s" />
<DATA>
<BINARY> <STREAM href="file:///scratch/myhome/test.bin"/> </BINARY>
</DATA>

    </TABLE>
  </RESOURCE>
</VOTABLE>

This makes the content of the individual field components more explicit. Each gets it own UCD for example.
I have removed the rank attribute for the moment. There is no way yet to specify the spatial coordinates of the grid cells. For a grid one can specify
the spatial coordinates in general in a shorthand way, for example using a set of standard parameters
as in the FITS array keywords (see
http://fits.gsfc.nasa.gov/standard21b/fits_standard.pdf 5.4.2.5), CRPIXn, CDELTn etc. I think we need to specify something like that here as well, it is definitly more
efficient than having separate cubes with the coordinates. Luckily in general our coordinate system will not require the full WCS like formalism in general.

Still I also liked your original approach, which, as I commented in my earlier reply, seemed to lead to a kind of equivalence in XML of the FITS image specification. I wonder whether the VOTable group has considered to put image data in an XML form of FITS just as they did for the FITS binary table. I'll pose the question on their mailing list. Tough we can use the multi-dimensional array, it seems not as natural.

Then, though it is possible to use this same formalism for particle data as well, I think there the tabular approach is more natural in many circumstances. In particular in the work that I have been doing with databases,
the natural representation of more complex individual objects is as a table, with all the properties, including
now the positions, in a row. The way to store such tabular datasets in binary form is specified exactly in the the existing VOTable spec, in section 5.3. An equivalent C-struct oriented format in binary files is what I have encountered consistently for more complex objects coming for example from the postprocessing of cosmological simulations at the MPA in Garching.

But you're right that many people also store particle data in individual arrays for each particle property.
That is more naturally mapped in the sense of your rank 1/2 examples. Making the same adjustment as above for
the datacubes I would propose to allow also something as in the following example:

<?xml version="1.0"?>
<VOTABLE xmlns:xsd="http://www.w3.org/2001/XMLSchema"   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"   xmlns="http://vizier.u-strasbg.fr/xml/VOTable-1.1.xsd">  <RESOURCE name=myParticles>
   <TABLE name="Particles" ID="NBody">

<FIELD name="x" ID="x1" ucd="pos.cartesian;pos.cartesian.x"
datatype="float"

             arraysize="100000x1"   unit="Mpc" />

<FIELD name="y" ID="y1" ucd="pos.cartesian;pos.cartesian.y"
datatype="float" arraysize="100000x1" unit="Mpc" />
<FIELD name="z" ID="z1" ucd="pos.cartesian;pos.cartesian.z"
datatype="float" arraysize="100000x1" unit="Mpc" />
<FIELD name="vx" ID="vx1" ucd="phys.veloc;pos.cartesian.x"
datatype="float" arraysize="100000x1" unit="km/s" />
<FIELD name="vy" ID="vy1" ucd="phys.veloc;pos.cartesian.y"
datatype="float" arraysize="100000x1" unit="km/s" />
<FIELD name="vz" ID="vz1" ucd="phys.veloc;pos.cartesian.z"
datatype="float" arraysize="100000x1" unit="km/s" /> <DATA> <BINARY> <STREAM href="file:///scratch/myhome/test.bin"/> </BINARY>
</DATA>

    </TABLE>
  </RESOURCE>
</VOTABLE>

I would advocate supporting both representations for particle data, tabular and (1D) array.
In the latter case we still need something to distinguish between particle data and image data.
Your rank basically does that, just the name might be unfortunate. We might want to be more explicit
about the kind of data that is stored, an attribute with values MESH, N_BODY maybe ?

In your example you use an HDF5 binary file. VOTable does not support that, though it does support FITS,
I suppose as BINARY table (see VOTable spec section 5.2). Is there a natural mapping from VOTable key words
to HDF metadata structures ? Or shall we first concetrate on the binary serialisations specified in VOTable ?

Cheers

Gerard



From: Claudio Gheller [mailto:c.gheller-at-cineca.it] Sent: Thursday, August 31, 2006 2:55 PM
To: Gerard
Cc: theory-at-ivoa.net; Ugo Becciani; R. Smareglia Subject: Re: VOTable for simulations

Ciao Gerard,
in the meantime I had thought a litlle about possible formats for the VOTable. In fact I come to the conclusion that there is little new to add to the already existing VOTable specification, both for grids and for particles.
The only parameters that I think we have to add is the "rank" parameter (it may already exist, but I could have missed it). Rank is the only parameter that makes grids different from particles, scalars from vectors. For the rest, particles are completely the same as grids. NO different approaches are needed.

In practice:
Rank = 1  --> scalar on particles (a sequence of scalar values associated to the N particles,  one info  per particle, N values) Rank = 2 --> vector on particles (sets of three values per particles, Nx3) Rank = 3 --> scalar on grids (one value per grid point, NxNxN - assuming a cubic grid for simplicity)
Rank = 4 --> vector on grids (set of three values per grid point, NxNxNx3) At the moment let's consider only 3D simulations. >From the example belowe you can notice that "rank" is a ridondant info that can be obtained also directly from the "arraysize" parameter. But you must go through a parsing and therefore it could be useful to keep it highlighted in a specific parameter.

In this version, more variables, of different sizes, can be stored in the SAME file. The file could have different formats (fits, hdf... that must be specified properly). I assume, for the moment, a raw binary file, where variables are written one after the other (the standard table structure in row and colums is not efficient or even possible). The entry point for each variable can be easily calculated using the "arraysize" and "datatype" parameters. Furthermore, the order in which they are sotred must be specified. And this could be the order in which the FIELDs are stored in the VOTable.

Example: our data file contains a scalar field on a mesh, a vector field on a mesh, a scalar and a vector fields on particles:

<?xml version="1.0"?>
<VOTABLE xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://vizier.u-strasbg.fr/xml/VOTable-1.1.xsd ">

       <RESOURCE name=myTestResource>
               <TABLE name="BmTemperature" ID="MyTestTable" >
                       <FIELD name="BmTemperature" ID="myTestObject1" ucd=""
datatype="float" rank="3" arraysize="41x41x41"   unit="K" />
                       <FIELD name="BmVelocity"    ID="myTestObject2" ucd=""
datatype="float" rank="4" arraysize="41x41x41x3" unit="km/sec" />
                       <FIELD name="ParticlPos"    ID="myTestObject3" ucd=""
datatype="float" rank="2" arraysize="10000x3"    unit="Mpc" />
                       <FIELD name="PartDens"      ID="myTestObject4" ucd=""
datatype="float" rank="1" arraysize="10000"      unit="g/cm3" />
                       <DATA><BINARY>
                       <STREAM href="file:///scratch/myhome/test.h5"/>
                       </BINARY></DATA>

               </TABLE>
       </RESOURCE>
</VOTABLE> 

Let me know your opinion.
Claudio

Gerard wrote:
Hi Claudio
Sorry for the late reply to this email. I'm Cc-ing the theory group as well

I gather you are thinking of grid simulation data here, so this mail does not apply to N-body. Anyway, for that I think we can use the VOTable spec as it stands, in particular section 5.3 dealing with binary serialisation (see http://www.ivoa.net/Documents/REC/VOTable/VOTable-20040811.pdf ).

In the case you address, would it make sense to try to mimick FITS in the naming of key words, so use NAXIS for rank, and NAXIS1 for size0, NAXIS2 for size1 etc for the dimensions ? If I am not mistaken VOTable itself is based on the FITS binary table spec, so your proposal might be seen as a translation of a FITS datacube (IMAGE). Did we actually not think about using FITS as is for (uniform) grid simulations ? In that case your proposal could also be used I guess, where iso STREAM we'd have FITS as in standard VOTable usage (though I don't know whether votable presumes that the FITS file contains a table).

I am not sure whether FITS images/datacubes allow multiple values per cell (i.,e. have an array size), but don't think so. Otherwise we could probbaly generalise in that direction.
Do you propose to follow the VOTable/FITS directions on little-vs big-endian ?

Cheers

Gerard   

-----Original Message-----
From: Claudio Gheller [mailto:c.gheller-at-cineca.it] Sent: Thursday, July 20, 2006 12:37 PM
To: Gerard Lemson; Ugo Becciani; Alessandro Costa; Marco Comparato; R. Smareglia
Subject: VOTable for simulations

Dear friends,

I have tried to figure out the structure of a VOTable for simulated data. In the following the result.
I made the following assumptions:

  1. data are binary
  2. the binary file is a raw stream of byte, with no structure (no fits, no hdf...). It is external to the VOTable (at the moment I've not considered base64 conversion for performance reasons)
  3. Each file has an XML descriptor associated. The descriptor at present gives only the necessary infos to deal with the file.
  4. Each file contains ONE variable. This is suggested for the following reasons - data rank and size can change from variable to variable. - complex description - The association direct XML header file - bin file - variable, is easier to handle. - smaller files - files easier to handle by external applications (also not VO-compliant) - drawback: proliferation in the number of files However we can consider the support to more complex files or even formats, like FITS or HDF5. But let's start with something simple.

At this point I made the Snap program create binary files (at present still HDF5, but just for backward compatibility) and associated XMLs. For example:
test.h5 ----> snapped data
test.h5.xml ----> associated VOTable:

<?xml version="1.0"?>
<VOTABLE xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://vizier.u-strasbg.fr/xml/VOTable-1.1.xsd ">

        <RESOURCE name=myTestResource>
                <TABLE name="BmTemperature" ID="MyTestTable" >
                        <FIELD name="BmTemperature" ID="myTestObject"
ucd="" datatype="float" arraysize="41x41x41" unit="Kelvin" />
                        <PARAM name="rank" datatype="int" value="3"/>
                        <PARAM name="size0" datatype="long" value="41"/>
                        <PARAM name="size1" datatype="long" value="41"/>
                        <PARAM name="size2" datatype="long" value="41"/>
                        <DATA><BINARY>
                        <STREAM href="file:///scratch/myhome/test.h5"/>
                        </BINARY></DATA>
                </TABLE>
        </RESOURCE>

</VOTABLE>

Notice that the rank and size of the dataset is expressed in the arraysize keyword of FIELD. It is also written in the 4 PARAM fields. This is just to avoid the parsing of the string to get the basic info of rank and size and to have them directly as numbers (with their precise type). At present there are no UCD and no reference to the SNAP protocol, since both are not yet defined. I'm working on the latter...

This is the very first attempt!!! Let me know all your comments. Claudio

--
------------------------------------
Dr. Claudio Gheller, Ph.D.
High Performance System Division
CINECA - Bologna - Italy
Tel. +39-051-6171560
Fax. +39-051-6137273
------------------------------------
    




  



-- 
------------------------------------
Dr. Claudio Gheller, Ph.D.
High Performance System Division
CINECA - Bologna - Italy
Tel. +39-051-6171560
Fax. +39-051-6137273
------------------------------------
Received on 2006-08-31Z18:59:51