Software developers and ontologists generally live in two different worlds. The former group think they are building systems to perform information processing and computation, and the latter group think they are formally describing some aspect of the world.
[Note: slight change to wording of FOPP on 30/May/2011]
The reality is:
- software developers are always describing some aspect of the world (including its information);
- ontologists are ultimately interested in ‘systems’ because that’s where reasoning on ontologies gets done.
The current situation in IT is that the two key activities of the respective groups – ‘information modelling’ and ‘fact description’ – share key essentials, yet hardly intersect in software engineering or philosophy education, textbooks or in practical ways.
Today however we live in an age where we want not just ‘information’ and ‘computing’, but intelligence. Doing that means not just gathering data and pushing it into databases and on the screen, but being able to make inferences from it. Corporations and governments are doing this all they time. It’s how Amazon knows what to recommend for you. Yet today, achieving this is still too hard, and one of the key blockers is the vast gulf of understanding between the software builders and the ontologists. The symptoms of today’s situation are that:
- the majority of information models are poor, and break basic ontological (and even recognised ‘good practices’) rules;
- the majority of ontologies are disconnected from mainstream computing frameworks, and particularly data, and therefore remain in a somewhat academic state.
The interesting thing is that if information modellers were to follow some good ontology practices, they would build better software, even if they a priori don’t care about inferencing or other added value provided by ontologies. In this post I will deal just with the first problem. The challenge of how to get ontologies better connected with data and software frameworks is for another day…
What is an Information Model?
For our purposes here, I will assume the industry-typical object or ‘class’ model, commonly expressed in UML, e.g. as seen here. Such models are supposed to be based on various principles of object-orientation, including abstraction, encapsulation, modularity and so on (see Robert Martin’s site for a pretty good summary). Such models are invariably implemented in some programming language, and ultimately lead to ‘data’ being created, which are technically ‘instances’ of the ‘classes’ found in the model. So for example, in an object-oriented system, there may be a class (or ‘type’) PERSON, from which thousands of PERSON instances are created, each one containing data describing an actual person, e.g. in the Amazon customer database.
For those more familiar with relational concepts, the principles are the same, except that inheritance and encapsulation are weaker semantics; model ‘classes’ are ‘tables’, ‘class properties’ are ‘columns’ and ‘instances’ are ‘rows’.
What is an Ontology?
There are many scolarly answers to this question, e.g. Tom Gruber (Stanford), the wikipedia entry, and John Sowa’s pages. In Health Informatics, the many publications from Alan Rector‘s group at Manchester University (where the OWL language originated) and Barry Smith (University at Buffalo, NY, National Center for Biomedical Ontology, US) are worth reading. A rough summary of ontology in relation to openEHR and EHRs is here. There are a lot of words here, and clicking on such links may take you away for several years! Having come from the software engineering camp myself, I will offer my own definition in the interests of being practically relevant:
- an ontology is a formalised description of some aspect of the world of interest to some community.
In its essence, a formal ontology consists of a set of named ‘types’ or ‘categories’, linked by IS-A relationships, forming a taxonomy. A medical taxonomy might include the categories ‘cancer’ and ‘lymphoma’, with the latter being related to the former by an IS-A relation. More sophisticated ontologies also define ‘properties’ on each category, for example, the category ‘hepatitis’ has a property ‘site’ whose target is ‘liver’, itself a category and subtype of ‘organ’.
Is an ontology a ‘model’? Some ontologists would object to this, if they think you mean ‘information model’, but ontologies are of course ‘models of understanding’ or of ‘conceptualisation’, i.e. a written down expression of a constructed idea of X in the real world, as opposed to some direct representation of X, a.k.a. ‘the truth’. Realist philosophers and most non-philosophers understand that the only description we have of any phenomenon is the description coming from our mental conceptualisation of the latter, because ‘description’ is an activity that converts mental models of things to an external representation. Further, the ontologists’ version of doing this is to consciously and formally describe. Therefore, ontologies are ‘models’, in the broader sense of the word.
One of the key concepts everyone thinking about models, whether of the ‘information’ or ‘ontology’ kind, must have clear is the distinction between categories (a.k.a. ‘classes’, ‘types’, ‘kinds’) and individuals (a.k.a. ‘particulars’, ‘instances’). ‘Dog’ is a category; my dog Rex is an instance in the real world of the ontological category ‘Dog’. More subtly, my ‘allergic reaction (of 1 May)’ is not a kind of ‘allergy’ nor an instance of ‘allergy’, but an instance of ‘Substance reaction’ which itself is probably linked to ‘Allergy’ by a property to do with ‘Symptom’.
Bridging the Gap
So far so abstract. Let’s get to the meat of the matter. The 2008 paper “Adapting Clinical Ontologies in Real-World Environments” by Stenzhorn et al describes the basic tenet for ontologies as:
The main construction tenet for ontologies is the taxonomic principle to the effect that a type S is a subtype of another type T if and only if all instances of S are also instances of T…
This is a key principle and applies equally to information models. Most developers of information models would agree with this principle, and yet they routinely produce models that break it! Why is it so? The reason is to do with ‘properties’. In information models we do not define only categories, but we define ‘properties’ on each category – these become the ‘information’ in databases.
The same paper goes on to state the defining property of an ontology thus (my emphasis):
As a fundamental principle, all properties associated with any given type in any ontology must be true for all instances of this particular type. Thus, as an example, all instances of appendectomy are performed on some instance of appendix and all instances of water molecules contain oxygen and hydrogen.
This gets close to the key principle needed to join both camps. However as this paper acknowledges, and any experienced IT person knows, it doesn’t always work in reality. The ideal category ‘hand’ for example in a human anatomy ontology will describe one thumb and 4 fingers as being parts, but of course a person who has lost a finger in an accident still has a ‘hand’. Stenzhorn et al mention ‘canonical ontologies’ as a recent development to get out of this problem. And there is a general problem that ‘wholes’ have to be born or constructed out of bits… when does a pile of components in the Toyota factory become a ‘car’?
From the software engineering and information point of view, I would offer the following modified definition:
As a fundamental principle, all properties defined on any given type in any ontology or information model must have the potential to apply for every instance of this particular type, at some point in its lifetime.
(Note: I had unintentionally retained ‘must be true’ from the original above, but ‘must apply’ is obviously more correct, particularly since many properties in object models are non-boolean [changed 30/May/2011])
This now allows for ‘hands’ without a ‘thumb’ later in life, ‘cars’ whose ‘engines’ have been removed for servicing (but remain ‘cars’ during such operations) and all other contingent divergences from the ideal definition of each category. For lack of an official name for this rule (to my knowledge at least), I will denote it the fundamental ontology property principle (FOPP).
How does this help information modelling?
Amazingly, none of the well-known principles of object modelling directly state this property, although it is implied by some of them, particularly the Liskov Substitutability Principle (LSP). And yet it is key to building information models that work.
Let’s consider a practical example. On the category Animal could we define the property ‘body-plan‘, with such possibilities as ‘soft tissue’, ‘chordate’, ‘exoskeleton’ and so on – covering slugs, humans, and beetles for instance? Does this pass the test above? Indeed it does: every ‘Animal’ instance must have a body plan of some kind, even if it has been squashed by a truck and its plan is in slight disarray.
Let’s try another one: wingspan. Could ‘wingspan’ make sense at some point in the lifetime of every instance of Animal in the real world? It certainly can for most insects, even if only after metamorphosis; it does in mammals such as bats and birds, and numerous types of reptile (in the past at least) could fly. However there are many Animal types for which ‘wingspan’ can have no meaning, ever in their lifetime, including all simple organisms, soft-tissued animals, and the majority of chordates to name some. Wingspan only has a meaning for some body plans: specific exoskeleton and specific chordate plans to be precise. So wingspan should never appear on the class Animal. If it does, it becomes a useless data item, always Void, for most instances in any information processing system dealing with Animal instances.
Many information modellers would have no trouble distinguishing the two cases above, because they can reason about the categories by referring to the real world, which contains not only real instances but a lot of nice books and television programs that educate us on such facts.
Unfortunately, information models are much more often about abstract concepts, or to be correct, types which have no direct referent in the real world, or clash with mathematical concepts (see this surprisingly good C++ FAQ page for a discussion of the famous Circle versus Ellipse modelling problem). For example, in health informatics there have been various efforts to define a set of ‘data types for health information’. The categories in such a model might include those shown below:
In this diagram, the two types DATA_VALUE and TIME are ‘abstract’, meaning they have no direct instances. The basic idea in such information models is to define properties on each type in such a way that more specific types (i.e. lower down the ‘inheritance’ or ‘is-a’ hierarchy) have properties specific to those types.
The challenge for many modellers is to follow the fundamental ontology property principle. One of the common failures is to put properties too high in the hierarchy, i.e. to make the ‘wingspan error’.
The effects of this can be truly problematic. In realistic models with numerous properties, if say only 5 – 10 properties are wrongly defined in the class DATA_VALUE, along with their attendant methods (i.e. routines), and the same practice occurs in other classes, then concrete classes deep in the hierarchy, such as CODED_TEXT in the example here end up with numerous useless properties and routines. Consequences include:
- the classes are very hard to reason about and therefore program, meaning they are error-prone and may cause serious bugs in behaviour and data;
- many instances in the data may contain numerous Void fields which have to be dealt with in some way, and which unnecessarily complicate databases;
- it breaks the extensible nature of normal object models, which requires properties to be added going down the inheritance hierarchy, and to the most specific class for which it makes sense;
- it makes models and software brittle, since applying this principle in the extreme requires all possible properties of all descendant types to be included in the base class. This can never be successfully done, since noone can predict all future subtypes needed in a model.
Since in the real world, information and software class models used in a given system are rarely all built by the system developer, it has another important consequence: it can cripple standards usability. Specifically, it prevents developers of lower down classes easily using and building upon specifications of higher up classes, e.g. as issued by standards bodies. If such bodies violate the FOPP, they create huge downstream problems for developers – who have no freedom to correct the problems. They also make their own lives harder, since it is usually a bigger job to get agreement on a class with more properties rather than less.
Violation of the FOPP is the key reason for many of my objections to the HL7 / ISO 21090 data types, mentioned in previous posts, and also the HL7 RIM. One of the reasons for this violation is that the developers of a particular model create it for their own specific purposes and scope, and it may initially not violate the FOPP in that scope. However, if it is then proposed for wider use where the possible instances of each category is vastly expanded and more diverse, many of the assumptions built into the original model fall apart.
My challenge to both ontologists and IT educators: get the FOPP into the minds of practitioners in both camps and onto the first pages of all relevant textbooks.