Definitions, costs and use-cases

From: Norman Gray <norman-at-astro.gla.ac.uk>
Date: Sat, 22 Sep 2007 11:12:03 +0100

Rob and all,

Rob:

> My previous question about use cases sank without a trace. Perhaps
> I can try again. Here you are making a "case for using this tool",
> not asserting a use case. Before learning to use a tool, one needs
> to know, as Edwin Starr says: What is it good for?

Let's knock this on the head. You want definitions, costs and use- cases: we've got 'em.

I think various people _have_ discussed these before, possibly implicitly, but certainly at length, so I expect that in places I am going to repeat what others have said. I hope that having it in one place will help tie this up, so we can move toward more specific questions, which is where I think Andrea has been trying to herd us.

This has turned into rather a long message. The 'costs' section may go into too much detail, but the definitions and use-case sections should be reasonably compact.

In fact, this message has the feel of a first draft of a wiki page. Should we be having this discussion in a different form?

Definitions

We won't get much from 200-year old dictionary definitions, or technical abstracts from completely different domains (philosophical ethics, as I recall), so here's the standard definition from computer science (all together, class!):

     an ontology is a formal specification of a shared conceptualisation

That is:

     conceptualisation = a set of things/concepts/types, as appropriate
     shared = ...which at least one other person agrees with
     specification = ... and which you've written down
     formal = ...in a machine-readable way

The various terms folksonomy, vocabulary, thesaurus, taxonomy and ontology all have slightly different definitions (they overlap in practice), but all exist on a single spectrum, or ladder, from informal and suggestive at one end, to formal and expressive at the other. The 'ontology' range is further subdivided ('RDFS' is at one end and 'OWL' at the other -- let's not worry just now).

'Semantics' is just the stuff you're doing after you've grokked whatever syntax you're using, and 'semantic search' is 'trying to do better than simple string matching'. Yes, Google does do rather well with 'simple string matching', but that's because (a) they don't have any choice, as there isn't a great deal of semantically rich material on the wild wild web, outside of specialised domains such as ours; and (b) they have money and kit to throw at the problem of guessing meaning from string coincidences.

Costs

Processing costs: Processors become more (computationally) expensive as you go from less to more formal. Handling a folksonomy requires strcmp(3); handling an ontology requires one of several types of reasoner[1].

However processors become much more efficient as you go towards the more formal end, since you have to work quite hard with strcmp, tolower and friends, and be quite clever, to extract much meaning like 'this resource is more specific than that resource'. That sort of thing is much more immediate, further up the ladder.

Acquisition costs: Folksonomies[2] are a big deal currently because they offer a way of talking about the only vaguely semantic information realistically available on the web. Adding richer information is dramatically more expensive (issues of education, hassle, payoff to the tagger), so might be worth it only for small, high-value, data collections (such as the registry?), or collections which already have most of the structure visible already (example?).

Note that not every application necessarily benefits from more expressive structure. Myself, I think that SKOS (taxonomy/thesaurus) provides most of what you really need, and can reasonably acquire, to support searching. Ed is a more unequivocal enthusiast for OWL. The CDS ontology can support automatic classification ('if this object has these properties then it must be of this type'), but not every reasoner can cope with it.

The upside, from the point of view of acquisition costs, is that most of the sciences, with their journal keywords, and the systematising mindset of their users, can probably get on to the second rung for free. The much-lamented lack of interest in the IAU keyword list suggests that getting on to the third rung might be a struggle. The existence of the registry indicates that the people running archives can be persuaded to supply reasonably extensive/expensive semantic information; the prospect of this bringing users to them, and the embarassment of their logo not appearing where it ought, are what will persuade them to do this _and_ get it right. The largeish number of errors in registry entries show that the benefits -- custom and visibility -- have not yet ben perceived to match the costs.

Opportunity and development costs: Developing (which means agreeing on) a new vocabulary or an ontology is very hard work, and very expensive (as we all know...); it should therefore be avoided as much as possible. Repurposing an existing vocabulary is much better: even if it's not perfect, the benefits of it actually existing outweigh the costs of the fit being a little loose.

Resuse is better than redevelopment for other software as well (news just in: sin is bad!), but the costs are especially high for vocabulary development, since it necessarily consumes the effort and good temper of multiple people simultaneously, and it probably involves the time of valuable domain experts (you can't just hire someone).

Reusing an existing vocabulary should be cheap, and might consist of nothing more than some Perl magic to put the right type of pointy brackets round the items in your vocabulary list.

The tools and APIs for supporting reasoning (ie, working at the top end of the ladder) are rather hard to use, in my experience, for a mixture of reasons: what they're doing can be rather confusing, and they're still aimed at a fairly specialised developer community, so there isn't the sort of tutorial and community support that would let Joe Developer just pick up a tool and start creating. What that means, I think, is that where those tools are useful, they should be well hidden as services or as middleware, and the community should have a fairly explicit plan about how it will maintain them in the medium term.

At the bottom end of the ladder, there are much more approachable tools for handling and storing RDF (though I haven't yet had to actually API-call an RDF parser, and most of my work in this area has been using XSLT).

That presumes you're using RDF. The benefits have been rehearsed elsewhere this month, so I'll skip them here. The main cost of not doing so is that you cut yourself off from the rest of the world.

Use-cases

Mathilda is reading a paper online. She types the (A&A) keywords for the paper into VOExplorer and asks for 'more like this'. VOExplorer calls out to a service which finds the AOIM and Simbad equivalences of the A&A keywords, and uses the former to query a suitable service to find some pretty pictures, and the latter to query Simbad, presenting the two lists to Mathilda. There aren't many pretty pictures, so Mathilda asks to expand the search, and VOExplorer asks for pretty pictures corresponding to a more general term, found either directly in the AOIM vocabulary, or finding a more general SImbad term and finding the AOIM equivalent of that. The Simbad query, on the other hand, has produced far too many hits, so VOExplorer looks down the tree of Simbad terms which are 'narrower', and asks 'you were looking for compact objects: do you mean black holes, quasars, or...?' Once she has established a suitable keyword or keywords, she can make a queries using the equivalent terms in whichever vocabularies the registry or VOEvent keywords are drawn from. She finds some heterodyne observations (apologies if this is astronomical nonsense, but...), but she's an X-ray person, so is a bit vague, and curious, about just what that is -- but oooh, there's a link to DBpedia/wikipedia, so she goes there on the off-chance the article is decent. The mechanism that brought in the link to DBpedia is the same one that is linking a growing collection of non- specialist semantic resources (see the 'linking open data' project: <http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/ LinkingOpenData>).

Most of the components of that are in place already, in the sense that the vocabularies exist and services can be queried using them. VOExplorer already makes a callout to a skeleton service which doesn't do anything useful yet, but will be expanded starting next year (funding's just arrived). The CDS people (Alexandre in particular) have already demoed an application using the Ontology of Astro Object types which does something similar to the business of broadening and narrowing the Simbad queries).

All the best,

Norman

[1] A 'reasoner' is something which, for example, deduce that an instance of a given subtype is also an instance of the type.

[2] A 'folksonomy' is a del.icio.us or Flickr-style cloud of keywords, or the keywords on eBay or a conference abstract, where people ask themselves `what keyword would other people use to search for this?'. 'Folksonomy' is the same as 'free keyword list with counts of occurrences', but is fewer characters to type.

-- 
------------------------------------------------------------
Norman Gray  :  http://nxg.me.uk
eurovotech.org  :  University of Leicester, UK
Received on 2007-09-22Z19:19:03