Dave:
Actually we use logical storage names in iRODS/SRB to simplify
interactions with remote storage systems. Users interact with:
The system selects which replica of the file to access, based on whether there is a copy at the same IP address, whether there is a copy on any disk resource in the world, or whether there is a copy on tape.
When a user writes a file, the logical storage name can be used to represent:
From the perspective of the user, they only interact with logical file names and logical storage names. A sophisticated user can query the system, identify the associated physical storage systems, and direct specific replicas to specific storage locations. This is the exception.
The concern about integrating both storage systems and databases into the same logical name space is handled by differentiating between BLOBs and tables. The SRB provides the same set of operations on BLOBs as on files. Thus a Binary Large Object can be read, written, modified through the same file manipulation commands used to read, write, modify files.
For interactions with a database table, the user issues a different set of commands, such as an SQL query command with the results packaged as an XML file that is sent to the client.
Operations that depend upon the data type of a file are implemented through separate APIs. An example is manipulation of HDFv5 files. We use the HDFv5 client APIs to control the requested operations, but execute the HDFv5 library calls at the remote storage system to perform the desired manipulations.
If a user extracts data from a database table into an XML file using an SQL query command, a SRB client that understands how to parse XML files is needed to do further manipulation.
The iRODS system removes this restriction by supporting micro-services that process explicit data format types. A rule can be written that checks the data type, and then automatically invokes the correct parsing operations for manipulating the file. It is possible to create a rule that checks the type of storage system, and then based upon the file type issues different micro-services to parse and manipulate a file.
The VOSpace interface can remain independent of the logical name spaces used by SRB/iRODS. However, interactions with data in the data grid would be based on the logical names for both files and storage. The physical file names (including replicas) and actual storage locations would be hidden from the users by default data grid functions. The default functions would implement a standard algorithm for selecting the best file (closest replica) and for writing to the best storage resource.
Reagan
>Matthew Graham wrote:
>
>>A request has from our friends at SDSC to include references to the
>>actual storage units that data is being deposited on. The use case
>>is data replication so, for example, I want to move/copy a data
>>object from a slow tape archive to an ultrafast disk but both
>>hardware units are within the same VOSpace or I want to retrieve a
>>data object from the ultrafast disk copy and not the slow tape one.
>
>Yep, would be nice to have this.
>We talked about this with the SDSC developers in February, but I
>haven't figured out an easy way to integrate this into vospace yet.
>
>>I think that we can incorporate this easily into our existing data
>>model. We will refer to hardware units as logical storage units
>>with the implication that they are identified via a logical
>>identifier (URI) that is set by the particular VOSpace
>>implementation.
>
>Ok so far.
>
>>To get the list of available storage units from a VOSpace, we will
>>need a method: getLogicalStorageUnits() which will return a list of
>>URIs.
>
>Problem : this works if everything is treated as a 'BLOB'.
>As soon as we distinguish between types of structured data, e.g.
>'tabular data' or 'image', then the global method no longer works.
>
>If I have three 'logical storage units', two implemented as disk
>files, and one as a relational database, then the system behaves
>differently depending on which 'storage unit' the data is stored in.
>
>You could transfer tabular data from the 'database store' to one of
>the 'disk store'(s), but it would be stored as a file on the disk,
>and you will probably loose the ability to treat it as structured
>data (would this change the node type from StructuredData to
>UnstructuredData ?).
>
>If you have two copies of the data, one in a database store and one
>in a file store, and used an ADQL interface to modify the structured
>data in the database, do all the changes get replicated to the copy
>stored as a file ?
>
>How do we express rules like "you can replicate a FITS tabular file
>in a database store, but you can't replicate a FITS image file a
>database store".
>
>To do this sort of thing, we would need a list of 'allowed stores'
>for each node, and some may be mutually exclusive.
>So although you could transfer the tabular data from database store
>to file store, you can't have it in both at the same time (one is
>structured queryable data the other isn't - if it was in both,
>would it be represented as a Structured or Unstructured data node ?).
>
>We have a similar problem with the global listViews and
>listProtocols methods at the moment, not all nodes may be able to
>support all the views and protocols, but the global methods don't
>tell you which ones are valid for which nodes.
>
>>These URIs may be resolvable to a description of the storage unit.
>
>Yep, ok with that.
>
>>The logical storage unit identifier will be an optional argument in
>>the <transfer> entity so that as part of the data transfer
>>negotiation, the user can specify a list of storage units that they
>>want the data transferred to/from.
>
>This implies replicated storage, which is what SRB is very good at.
>However, this does add a lot of complications.
>
>Do we guarantee that data replication is handled transparently, or
>do we mark some of the stored data as out of date ?
>If the data for a node is stored in two 'storage units'[a] and [b],
>user 'A' sends new data to 'storage unit'[a] and user B reads their
>data from 'storage unit'[b], what data does user B get back ?
>
>Do the same permissions apply to all the copies of the data ?
>If user 'root' can read/write from all the stores, but user 'fred'
>can't write to the tape store, then what happens if 'root' creates a
>replicated copy of a node on the tape store. Can 'fred' still modify
>the data on disk (making the tape version out of date), or does it
>become read only because he can't modify the tape copy ?
>
>>The identifier will also be an optional argument in the <node>
>>entity so that specific hardware can be targetted in moving and
>>copying data.
>
>If the data for one node may be replicated on more than one store,
>it would have to be a list of <store> elements in each <node>,
>>Comments, suggestions, etc.
>
>Yes, data replication and logical stores would be nice.
>However, to do it right would mean a lot of work, and add a lot of complexity.
>The SRB and IRODS system have already solved these problems, so do
>we really want to re-invent this particular wheel ?
>
>What is the science use case for this ?
>And can the use case be handled by using the existing SRB or IRODS systems ?
>
>When our astronomers saw a demo of IRODS, their comments were
> "I'd like the system to have this capability, but I wouldn't need
>to use it as part of my normal work".
> "It would be very useful if our sys admin had these kind of tools
>... so they could manage the data for us"
> ".. but as a scientist I wouldn't want to handle things at this level"
>
>If we add a list of [capability] elements alongside the [accepts]
>and [provides] elements in a [node], then a replicated data store
>based on IRODS could include [uri for IRODS interface + endpoint] as
>a capability.
>
> [node]
> [properties]
> ....
> [/properties]
> [accepts]
> ....
> [/accepts]
> [provides]
> ....
> [/provides]
> [capabilities]
> [capability uri="ivo://capability.uri.for.irods"]
> [endpoint].....[/endpoint]
> [/capability]
> [/capabilities]
> [/node]
>
>Effectively, the vospace service would be saying 'replication
>settings for the data in this node can be manipulated with the IRODS
>API using this endpoint'. We get access to all of the very nice
>tools that SDSC are developing, without having to define a whole new
>API for handling replication.
>
>Note : I haven't studied the IRODS in detail, but I am impressed
>with what I have seen.
>
>Note : This does not mean that IRODS is the only replication API. If
>we really, really, want to, we could still define an IVOA standard
>replication API, and add that as a capability.
>
> ....
> [capabilities]
> [capability uri="ivo://capability.uri.for.ivoa.replication"]
> [endpoint].....[/endpoint]
> [/capability]
> [/capabilities]However, w
> ....
>
>Data replication would be nice, but do we need to define it in vospace.
>Or can we pass it over to an established API that has been designed
>to handle this sort of thing.
>
>Dave
Received on 2007-08-15Z21:08:53