IVOA

Architecture of the IVOA
Version 0.2

IVOA WG Internal Draft 2004-04-06

Working Group:
http://www.ivoa.net/twiki/bin/view/IVOA/IvoaArchitecture
Author(s):
Roy Williams
Tony Linde
Jonathan McDowell
Tom McGlynn
Reagan Moore
Francois Ochsenbein
Masatoshi Ohishi
Guy Rixon
Doug Tody

Abstract

This document provides a high-level overview of the architecture of the International Virtual Observatory that has emerged over the last few years.

Status of this document

This is a Working Group Internal Draft.



Acknowledgments

1. Introduction

The architecture of the IVO is Service Oriented, meaning that components of the system are defined by the nature of requests and responses to services. Because of this, the description of the service is based on the choice of the protocols for requests and responses, rather than classes and methods. Each service is autonomous, and its boundaries well-defined. Services are inherently distributed, so they can be deployed on any machine that seems optimal.

Data is communicated between services in two basic formats: FITS, this has been an astronomical standard for many years; and XML, a standard syntax for encoding information. In the latter case, the IVO process allows a new proposed schema to become a standard through a well-defined community process; successful examples so far are VOTable, for representing tabular data with rich metadata; and VOResource, for describing entities in the IVO registry (see below). Future standards will include VOFrame, for space-time coordinate systems, and VORegion, for subsets of the celestial sphere.

Thus IVO services are built to exchange messages that are IVO-standard XML documents. A IVO-standard service type is defined by the nature of these messages. The community of data providers is encouraged to implement such services, and the community of data consumers is encouraged to build portals that use the service types.

The following diagram shows the essential components, which will be discussed in the rest of this paper.

2. Architecture Overview

The objective of the Virtual Observatory is shown at the top of the figure: to improve and unify access to astronomical data and services for primarily professional astronomers, but also for the general public. The top bar of the figure represents this objective: discovery of data and services, reframing and analysing that data through computation, publishing and dissemination of results, and increasing scientific output through collaboration and federation. The IVOA does not specify or recommend any specific portal or library by which users can access IVO data, but some examples of these portals and tools are shown in the grey box.

Different coloured vertical arrows represent the different service types and XML formats by which these portals interface to the IVO. In the IVO architecture, we have divided the available services into three broad classes:

These services are implemented at various levels of sophistication, from a stateless, text-based request-response, up to an authenticated, self-describing service that uses high-performance computing to build a structured response from a structured request. In the IVO, it is intended that services can be used not just individually, but also concatenated in a distributed workflow, where the output of one is the input of another.

The registry services are meant to facilitate publication and discovery of services. Each registry has three kinds of interface: publish, query, and harvest. People can publish to a registry by filling in forms that define services, data collections, projects, organizations, and other entities. The registry may also accept queries in a one or more languages (for example the IVO Query Language), and thereby discover entities that satisfy the specified criteria. The third interface, harvesting, allows registries to exchange information between themselves, so that a query executed at one registry may discover a resource that was published at another.

Registry services expect to label each VO resource through a universal identifier, that can be recognized by the initial string ivo://. Resources can contain links to related resources, as well as external links to the literature, especially to the Astronomical Data System. The IVO registry architecture is compliant with digital library standards for metadata harvesting and metadata schema, with the intention that IVO resources can appear as part of every University library.

Data services range from simple to sophisticated, and return tabular, image, or other data. At the simplest level (conesearch), the request is a cone on the sky (direction/angular radius), and the response is a list of "objects" each of which has a position that is within the cone. Similar services (SIAP, SSAP) can return images and spectra associated with sky regions, and these services may also be able to query on other parameters of the objects.

The OpenSkyQuery protocol drives a data service that allows querying of a relational database or a federation of databases. In this case, the request is written in a specific XML abstraction of SQL called ADQL (Astronomical Data Query Language).

The IVO will also support queries written at a more semantic level, including queries to the registry and through data services. To achieve this, the IVO is developing a structured vocabulary called UCD (Unified Content Descriptor) to define the semantic type of a quantity.

The IVO expects to develop standards for more sophisticated services, for example for federating and mining catalogs, image processing and source detection, spectral analysis, and visualization of complex datasets. These services will be implemented in terms of Grid middleware functionality, especially authentication and authorization for large-scale analysis requiring access to allocated resources.

The IVO is collaborating with a number of IT groups that are developing workflow software, meaning a linked set of distributed services with a dataflow paradigm. The objective is to reuse component services to build complex applications, where the services are insulated from each other through well-defined protocols, and therefore easier to maintain and debug. The IVO also expects to use such workflows in the context of Virtual Data, meaning a data product that is dynamically generated only when it is needed, and yet a cache of precomputed data can be used when relevant.

In the diagram above, the lowest layer is the actual hardware, but above that are the existing data centers, who implement and/or deploy IVO standard services. Grid middleware is used for high-performance computing, data transfer, authentication, and service environments. Other software components include relational databases, services to replicate frequently used collections, and data grids to manage distributed collections.

A vital part of the IVO architecture is the concept of MySpace, so that users can interact with persistent services. Services that are persistent for a short time can simply maintain state in the server memory, but MySpace extends this by allowing users to store temporary database tables, files and other state in non-volatile storage.

3. Architecture Components

Web and Grid Services

Working Group Lead: Guy Rixon

The IVO architecture uses services at different levels of sophistication, as illustrated in the Services bar in the figure. These levels are:

A standard is being developed for every IVO service, including basic heartbeat functions, discovery, access, formatting, retrieval, and manipulation. Given that three mechanisms are being used to implement the services, the corresponding challenges are how to provide equivalent interfaces across the implementations, how to manage the scale of the data manipulation requests, how to handle public (no charge) access versus use of allocated resources, and how to handle anonymous access versus authenticated access. The choice may be to restrict some implementation requirements (authentication, authorization) to specified services (Grid services).

Data Models

Working Group Lead: Jonathan McDowell

Data Models represent a view of an entity as an object in the sense of object-oriented programming; the object can be a subclass of another through inheritance; or the object can be coerced into looking a certain way by implementing a given interface. Data models can be expressed in several semantically-equivalent ways, such as C++ header files, Java interfaces and classes, Unified Modelling Language (UML), or as XML schemata.

Data Models provide the semantic protocol for exchange of queries and metadata between the services and the clients. Registries, data services and compute services will describe data and resources with standard sets of descriptions, so that the same kind of data is described the same way. The data model schemata will provide definitions of the metadata needed for particular kinds of data as well as place that metadata within a standard structure. This level of semantics, describing the structure of astronomical datasets, interacts with the astronomical semantics provided by the UCD schema to quantify use of astronomical knowledge. For example, a data model to define spectra may specify that a spectrum has an array of data representing an observable quantity and an array of values representing the spectral coordinate. The UCDs associated with an instance of this data model will specify whether that particular spectrum has an observable of flux or surface brightness, and a spectral coordinate of frequency or wavelength. A data model may also represent a higher level resource such as a compute service, in which the input parameters required by a particular class of service such as source detection programs are defined. Again, the values of some data model metadata may be UCDs which describe what kind of parameters are to be returned by the source detection.

Registry

Working Group Lead: Tony Linde

An IVOA-compliant registry provides XML-formatted metadata in response to queries. The metadata may be in different formats for different audiences: for example a librarian may be interested in receiving metadata conforming to standards established by the library community, such as Dublin Core or METS (Metadata Encoding and Transmission Standard). The METS standard provides a framework for defining extension schema to characterize IVO specific administrative, descriptive, structural, and behavioral metadata.

The IVOA has published a standard (VOResource) for metadata that is semantically meaningful to astronomers for use in Data Collections, Projects/Organizations, and Services.

The structure is based on Dublin Core, a metdata standard that has been standardized by the library community for describing in rough measure almost any human creation. Metadata elements include Title, Authors, Description, Format, Date, etc. For VOResource, the structure has been extended with VORegion, VOFrame, and VOService attributes that allow the description of regions of the sky, coordinate frames, and services.

Services can be of any of the three types listed above (GET/POST, SOAP, Grid). The second and third types of service carry the ability to describe themselves with a WSDL file or equivalent. In the IVO architecture, the WSDL file is considered as just a set of metadata, like the METS schema for the librarians. In the business world, tools and workflows are being set up on a foundation of SOAP services and their WSDL descriptors. The IVO registry is thus available to these more generic tools. An interesting challenge is the merger of the library METS standard with WSDL service description and Archival Information Packages from the preservation community. Through METS extension schema, it should be possible to use the same metadata framework to describe services, catalogs, and preservation environments.

Registries of the IVO are able to exchange information through a mechanism called harvesting, so that resource metadata known to one may be replicated in others. In this way, a collection of independent registries can become a single virtual registry. The harvesting protocol is OAI-PMH (Open Archives Initiative -- Protocol for Metadata Harvesting). Registries can query each other for the most recently updated records, then copy them.

Virtual Observatory Query Languages

Working Group Lead: Masatoshi Ohishi

The Virtual Observatory Query Language is a protocol for forming queries to IVO services, and is designed in three levels.

The first level of VOQL is called ADQL, and allows queries that are roughly equivalent to SQL (Structured Query Language). ADQL, however, is an XML expression of the query, that can be converted to any one of the vendor-specific dialects of SQL. Furthermore, ADQL has constructs for proximity queries, and for sky regions. ADQL is designed to be the request format of the OpenSkyQuery protocol, as defined in the Data Access Layer below.

In the second level of VOQL, databases can be federated (joined). A single query is split into sub-queries, each of which goes to a separate service, and the results are collected and joined.

The highest level of VOQL is a semantics-based language that allows astronomers to build queries in the language of astronomy rather than the language of databases.

Data Access Layer

Working Group Lead: Doug Tody

A cornerstone of the IVO is a collection of standard services with well-defined request and response formats, thereby providing a "standard screw-thread" for astronomical data: any IVO-compliant data consumer can work with any IVO-compliant data provider.

The scope of the IVO Data Access Layer also includes the software used to implement data access services. Ultimately this will include advanced capabilities such as data subsetting, data model mediation, and server-side analysis, i.e., for grid computing. We also include client-side software to demonstrate an end-user analysis capability and to perform end-to-end testing and integration. Ultimately most analysis software will come from the user community, not from IVO. DAL builds upon and integrates IVO technology for metadata, data models, data formats, registries, and queries.

The science data dealt with by the DAL potentially includes all of the following:

The highest priority goes to object catalogs and 2D sky images, for which prototype data access services are already available. Spectral data cubes are probably best treated as a general type of image. Next in priority are spectra, especially simple 1D spectra; spectra are a high priority for the next phase of DAL development. Time series data is less common but is similar to 1D spectra and could possibly benefit from a similar approach. It would be useful to integrate event list data and visibility data into the VO, although our expectation is that most VO users will be interested in images produced from such data rather than the original data (image generation may need to be on-the-fly since there is in general no one best way to produce images from event data or UV data).

The highest priority for IVOA data access standards is probably in the area of standard data access protocols - this is what we should emphasize in the first year. Protocols are, or should be, implementation-independent and hence are one of these easiest software elements to standardize. As the IVO software and infrastructure becomes more complicated it will become increasingly important to provide some reusable VO framework-level software to simplify the job of those putting up services or writing client-side applications. Finally as we move to grid computing it will become necessary to dynamically deploy computational software on grid-enabled computational resources. For this to be feasible we will need some interoperability standards in the areas of computational frameworks and components.

Unified Content Descriptors

Working Group Lead: Roy Williams

The Unified Content Descriptor (UCD) is a formal vocabulary for astronomical data that is controlled by the IVOA. The vocabulary is restricted in order to avoid proliferation of terms and synonyms, and controlled in order to reduce ambiguity as far as possible. It is intended to be flexible, so that it is understandable to both humans and computers. UCD describe astronomical data quantities, and they are built by combining words from the formal vocabulary.

A UCD description of a quantity does not define the units or name of the quantity, but rather 'what sort of quantity is this?'; for example phys.temperature is a semantic class description of temperature, without implying a particular unit.

The UCD committee has tried to resist the temptation to allow the UCD syntax to be overly expressive. Every measurement in science has the possibility of essentially infinite description: the people, the instruments, the error analysis, the reasons, the funders, and so on. We have tried to find a way of organizing atomic specifiers (words) so that it is easy to write simple software for machine use, but also possible to write better, more sophisticated software. This organization, in terms of properties and concepts, maps well to knowledge representation methods outside astronomy. We hope to build more sophisticated "intelligent" systems in the future, a project that has come to be called "UCD3". The major goal of UCD is to ensure interoperability between heterogeneous datasets. The use of a controlled vocabulary will hopefully allow an homogeneous, non-ambiguous description of concepts that will be shared between people and computers in the IVO. We hope in the future to put more semantic expressiveness into the UCD framework, but always keeping a pragmatic eye on those who would create and use the software that will parse the UCD vocabulary.

VOTable Format

Working Group Lead: Francois Ochsenbein

The VOTable format is an XML standard for representing a set of tables, aiming at exchanging properly described data between agents acting in the framework of the Virtual Observatory. In this context, a table is an unordered set of rows, each of a uniform format, as specified in the table metadata. Each row in a table is a sequence of table cells, and each of these contains either a primitive data type, or an array of such primitives.

VOTable is designed as a flexible storage and exchange format for tabular data, with particular emphasis on astronomical tables. VOTable has built-in features for big-data and Grid computing. It allows metadata and data to be stored separately, with the remote data linked. Processes can then use metadata to "get ready" for their input data, or to organize third-party or parallel transfers of the data. Remote data allow the metadata to be sent in email and referenced in documents without pulling the whole dataset with it: just as we are used to the idea of sending a pointer to a document (URL) in place of the document, so we can now send metadata-rich pointers to data tables in place of the tables themselves. The remote data are referenced with the URL syntax protocol://location, meaning that arbitrarily complex protocols are allowed.

When we are working with very large tables in a distributed-computing environment (the Grid), the data stream between processors, with flows being filtered, joined, and cached in different geographic locations. It would be very difficult if the number of rows of the table were required in the header. We would need to stream in the whole table into a cache, compute the number of rows, then stream it again for the computation. In the Grid-data environment, the component in short supply is not the computers, but rather these very large caches! Furthermore, these remote data streams may be created dynamically by another process or cached in temporary storage: for this reason VOTable can express that remote data may not be available after a certain time (expires). Data on the net may require authentication for access, so VOTable allows expression of password or other identity information (the "rights" attribute).

Applications

Working Group Lead: Tom McGlynn

The architecture must be driven by application requirements. A short list includes:

Data Engineering

Working Group Lead: Reagan Moore

Very high performance applications may wish to work more directly with the data, without being slowed by layers of software, so there is a possibility for direct, bulk access to data stores. The NVO data services would provide the names of files that are to be downloaded, but then bulk access mechanisms would be used for actually working with these files.

The data engineering requirement can be expressed more generically as latency versus granularity management. Sending a million images over a network one at a time is prohibitively costly (takes a very long time). Aggregating images into an appropriately sized container before transmission can decrease the transmission time by orders of magnitude. On current networks, a similar analysis is needed for choosing between serial or parallel data transport. Granularity analysis is also needed to choose how to aggregate files before storage in an archive to minimize the impact on the archive name space. The tools that manage data granularity are embedded in data grid software and accessed through grid services. The conjunction of all of the granularity analyses occurs in the data flow pipelines that manage the processing or catalog records and archive images.

Data engineering is also needed for preservation. Whatever technology is chosen today by the IVO will become obsolete within the next 5 years. This includes the choice of encoding format, the choice of service protocol, the data flow pipelines, and the underlying hardware systems. The IVO needs to be able to guarantee continued access to catalogs, image archives, and processing pipelines across multiple generations of technology. This is typically expressed as forms of infrastructure independence or virtualization mechanisms. Again the mechanisms that enable incorporation of new technology are currently incorporated in data grids.