Logical storage units in VOSpace 1.1

Dave Morris dave at ast.cam.ac.uk
Wed Aug 15 09:19:38 PDT 2007


Matthew Graham wrote:

> A request has from our friends at SDSC to include references to the 
> actual storage units that data is being deposited on. The use case is 
> data replication so, for example, I want to move/copy a data object 
> from a slow tape archive to an ultrafast disk but both hardware units 
> are within the same VOSpace or I want to retrieve a data object from 
> the ultrafast disk copy and not the slow tape one.

Yep, would be nice to have this.
We talked about this with the SDSC developers in February, but I haven't 
figured out an easy way to integrate this into vospace yet.

> I think that we can incorporate this easily into our existing data 
> model. We will refer to hardware units as logical storage units with 
> the implication that they are identified via a logical identifier 
> (URI) that is set by the particular VOSpace implementation.

Ok so far.

> To get the list of available storage units from a VOSpace, we will 
> need a method: getLogicalStorageUnits() which will return a list of URIs.

Problem : this works if everything is treated as a 'BLOB'.
As soon as we distinguish between types of structured data, e.g. 
'tabular data' or 'image', then the global method no longer works.

If I have three 'logical storage units', two implemented as disk files, 
and one as a relational database, then the system behaves differently 
depending on which 'storage unit' the data is stored in.

You could transfer tabular data from  the 'database store' to one of the 
'disk store'(s), but it would be stored as a file on the disk, and you 
will probably loose the ability to treat it as structured data (would 
this change the node type from StructuredData to UnstructuredData ?).

If you have two copies of the data, one in a database store and one in a 
file store, and used an ADQL interface to modify the structured data in 
the database, do all the changes get replicated to the copy stored as a 
file ?

How do we express rules like "you can replicate a FITS tabular file in a 
database store, but you can't replicate a FITS image file a database 
store".

To do this sort of thing, we would need a list of 'allowed stores' for 
each node, and some may be mutually exclusive.
So although you could transfer the tabular data from database store to 
file store, you can't have it in both at the same time (one is 
structured queryable data the other isn't  - if it was in both, would it 
be represented as a Structured or Unstructured data node ?).
 
We have a similar problem with the global listViews and listProtocols 
methods at the moment, not all nodes may be able to support all the 
views and protocols, but the global methods don't tell you which ones 
are valid for which nodes.

> These URIs may be resolvable to a description of the storage unit.

Yep, ok with that.

> The logical storage unit identifier will be an optional argument in 
> the <transfer> entity so that as part of the data transfer 
> negotiation, the user can specify a list of storage units that they 
> want the data transferred to/from.

This implies replicated storage, which is what SRB is very good at. 
However, this does add a lot of complications.

Do we guarantee that data replication is handled transparently, or do we 
mark some of the stored data as out of date ?
If the data for a node is stored in two 'storage units'[a] and [b],  
user 'A' sends new data to 'storage unit'[a] and user B reads their data 
from 'storage unit'[b], what data does user B get back ?

Do the same permissions apply to all the copies of the data ?
If user 'root' can read/write from all the stores, but user 'fred' can't 
write to the tape store, then what happens if 'root' creates a 
replicated copy of a node on the tape store. Can 'fred' still modify the 
data on disk (making the tape version out of date), or does it become 
read only because he can't modify the tape copy ?

> The identifier will also be an optional argument in the <node> entity 
> so that specific hardware can be targetted in moving and copying data.

If the data for one node may be replicated on more than one store, it 
would have to be a list of <store> elements in each <node>,  

> Comments, suggestions, etc.

Yes, data replication and logical stores would be nice.
However, to do it right would mean a lot of work, and add a lot of 
complexity.
The SRB and IRODS system have already solved these problems, so do we 
really want to re-invent this particular wheel ?

What is the science use case for this ?
And can the use case be handled by using the existing SRB or IRODS systems ?

When our astronomers saw a demo of IRODS, their comments were
    "I'd like the system to have this capability, but I wouldn't need to 
use it as part of my normal work".
    "It would be very useful if our sys admin had these kind of tools 
... so they could manage the data for us"
    ".. but as a scientist I wouldn't want to handle things at this level"

If we add a list of [capability] elements alongside the [accepts] and 
[provides] elements in a [node], then a replicated data store based on 
IRODS could include [uri for IRODS interface + endpoint] as a capability.

    [node]
        [properties]
            ....
        [/properties]
        [accepts]
            ....
        [/accepts]
        [provides]
            ....
        [/provides]
        [capabilities]
            [capability uri="ivo://capability.uri.for.irods"]
                [endpoint].....[/endpoint]
            [/capability]
        [/capabilities]
    [/node]

Effectively, the vospace service would be saying 'replication settings 
for the data in this node can be manipulated with the IRODS API using 
this endpoint'. We get access to all of the very nice tools that SDSC 
are developing, without having to define a whole new API for handling 
replication.

Note : I haven't studied the IRODS in detail, but I am impressed with 
what I have seen.

Note : This does not mean that IRODS is the only replication API. If we 
really, really, want to, we could still define an IVOA standard 
replication API, and add that as a capability.
 
        ....
        [capabilities]
            [capability uri="ivo://capability.uri.for.ivoa.replication"]
                [endpoint].....[/endpoint]
            [/capability]
        [/capabilities]However, w
        ....

Data replication would be nice, but do we need to define it in vospace.
Or can we pass it over to an established API that has been designed to 
handle this sort of thing.

Dave



More information about the vospace mailing list