Something called the “Yosemite manifesto on RDF as a Universal Healthcare Exchange Language” was published in 2013 as the Group position statement of the Workshop on RDF as a Universal Healthcare Exchange Language held at the 2013 Semantic Technology and Business Conference, San Francisco. Can such grand claims be true?
I’m not sure if either the slide above or the original reference are reliable at the moment, so I’ll reproduce the text here:
- RDF is the best available candidate for a universal healthcare exchange language.
- Electronic healthcare information should be exchanged in a format that either: (a) is an RDF format directly; or (b) has a standard mapping to RDF.
- Existing standard healthcare vocabularies, data models and exchange languages should be leveraged by defining standard mappings to RDF, and any new standards should have RDF representations.
- Government agencies should mandate or incentivize the use of RDF as a universal healthcare exchange language.
- Exchanged healthcare information should be self-describing, using Linked Data principles, so that each concept URI is de-referenceable to its free and open definition.
I’m sure the signatories’ hearts are in the right place, but unfortunately the universal claims made here don’t stand up to scrutiny.
The basic claim is that we should all be using RDF for healthcare data exchange, or else a format that can map to it. I’ll quote the justification for this (my bolding):
Serving as moderator of the discussion, David Booth presented a number of compelling arguments in favor of RDF. First, RDF takes syntactic and formatting issues off the table so that the people using the data can focus on core semantic issues. Second, RDF is schema promiscuous, meaning that you can create multiple models of the same data using RDF and those models can co-exist peacefully without wreaking havoc on the whole system. Third, RDF is a neutral and mature international standard governed by the W3C. It has the support and depth necessary to handle such a monumental task.
The panel was quick to note that RDF is not a perfect solution, and there will certainly be a great deal of difficulty in uniting the private healthcare industry under the banner of RDF–government mandates and incentives will almost certainly be a necessary part of this process–but more than any other tool, RDF has the potential to simplify and standardize healthcare data in a way that will make it exponentially more searchable, thereby making healthcare more affordable and more effective.
Could any of this be true? Let’s remind ourselves what RDF is: it is a conceptual modelling language, whose statements are in the form of subject-predicate-object triples. It doesn’t have typing, since it is designed for representing conceptual relationships. It has an associated query language called SPARQL.
Now let’s consider the problem of health data exchange. In general this means:
- Data travelling between systems, e.g. lab results system <=> hospital EMR system; insurance claims; statistical reporting to central government registers
- Data travelling between back-end service and apps, e.g. patient nursing summary screen in hospital talking to EMR system
- Extract / Transform / Load data extraction to generate study databases for secondary use
Nearly all of these systems today:
- are built using some combination of object-oriented programming, and relational or other kinds of database technology. This means they have their own models, which are typed.
- potentially process huge amounts of data, which means they use efficient representations for representation and transfer.
There is no end of interoperability problems of today’s health data systems (which is why I have worked on interoperable health solutions for many years). The above manifesto claims to solve the problems of differing physical representations (RDF makes this magically disappear apparently) while implying that semantic differences would be erased simply by expressing current data (models presumably) in RDF, and at the same time, querying would be ‘exponentially’ more effective.
If only it were so easy.
The problem of physical representation
Unfortunately, the manifesto authors seem to have misunderstood why there are diverse concrete formats in the world. Concrete formats are chosen depending on the task at hand. If representing peta-bytes of health data is the task, you need a space-efficient format (yes space does cost money, just get a quote for 10TB of high-availability storage from a cloud provider, and then one for 100TB, and see if they are the same) – invariably binary. If the job is moving small amounts of data between a back-end and diverse mobile apps, you need a light-weight web-friendly format – often JSON these days. If you are moving data between large systems that assume very standardised messages, you can use a format that strips the message schema information and just sends values; if not, the format needs to include some level of schema information. If privacy is an issue, then formats that support encryption and/or obfuscation are needed.
The diversity is endless. Turning everything into XML, RDF or some other silver bullet stops vendors actually achieving basic performance, volumetric and security aims. That’s why they don’t do it.
Data are based on typed, structured models
The other reason you can’t easily use RDF for most production data is that RDF is not typed – it only represents the conceptual relationships between entities. Most health data (like the data in other industries) is defined by typed, structural models.
Now, you can use RDF in its OWL form to do some of the job of UML, in a clunky sort of way, so that typing is available. But realistically, you might as well just use UML or any modern object-oriented and/or functional programming language. Better tools and a more advanced version of OWL may one day change this. But the point is, now we are talking about model representation, not data exchange.
The problem of semantics
The main problem with the claims here are that they gloss over the semantic differences of diverse healthcare data. Data from different systems and standards are built on different models. Sometimes these models are radically different. There is no quick way to make such differences go away. Converting the models to an RDF representation, even assuming that were possible, doesn’t make the differences go away. It might expose some of them better than the physical representations do, so a limited claim of the utility of RDF to help solve the semantic gap problem might make sense.
More searchable data
The above quote included the claim that health data in RDF would be ‘exponentially more searchable’, leading to better healthcare. What I suspect the authors are alluding to here is the idea that if only health data could be connected up, then querying would be able to find more facts, and make more conclusions. In theory that’s likely to be true. To achieve this you have to solve a) basic interoperabilty (just sharing the data) and b) the common semantic problem. Converting everything to RDF doesn’t make any of this go away.
So in end, the universal claims in the manifesto, particularly points 1., 2. and 5. are unsupported by the evidence and normal practices of software engineering and data processing.
Where RDF has value
Is RDF useful for anything in health? Certainly. It’s used to represent semantic models of various kinds, particularly ontologies, with projects to convert terminologies like SNOMED CT underway. Its utility here is that one can make assertions (e.g. in OWL) that provers can machine-process, to a) validate statements and b) generate inferences. For example, an EMR system may contain health records of patients from which the provider wants to find asthma sufferers. This is achieved by querying for e.g. asthma diagnoses, asthma medications, and asthma symptoms. Some of these will be coded, using codes from a terminology such as SNOMED CT. The various codes can be compared to the terminology to infer if the patient actually suffers from asthma. An RDF or RDF-like representation of the terminology enables semantic relationships like ‘is-a’ and ‘has-site’ to be easily represented and traversed.
Another possible use might one day be for representing content models, i.e. what we currently call Detailed Clinical Models (DCMs), archetypes, and so on. In my work, we use a language called Archetype Definition Language to do this. This is a constraint language that defines models in terms of constraints on an underlying information model. Various attempts have been made to replicate such models in OWL, over the period of a decade. Other approaches use XML-schema, UML and proprietary constraint languages. None that I am aware of uses RDF or OWL, because various namespacing and other issues have never really been resolved. Even if they were, I remain sceptical, since the real problem (in my view) is that archetypes and other DCM formalisms are Frame logic based, whereas RDF and OWL are designed for use as Description Logics.
Point 3. of the manifesto is a blanket injunction to use RDF to leverage the power of existing standards, models and vocabularies. This is already being done to some extent, but the reality is that real world representation problems are hard, and a decade of trying to apply OWL and RDF hasn’t produced any extraordinary breakthrough. Maybe such a breakthrough is possible, but in that case, some real evidence would be needed to back up the blanket claims of this manifesto, which ultimately aim to influence government funding.
I appreciate that RDF and people using it have things to offer. However, the claim that they have the one format that will solve everything is not at all helpful to their cause, or anyone else’s. What we need is an idea of where and how RDF (and OWL) can be applied to the numerous specific and difficult problems in e-health, and just as importantly, where it is not appropriate.