RDF for universal health data exchange? Correcting some basic misconceptions…

Something called the “Yosemite manifesto on RDF as a Universal Healthcare Exchange Language” was published in 2013 as the Group position statement of the Workshop on RDF as a Universal Healthcare Exchange Language held at the 2013 Semantic Technology and Business Conference, San Francisco. Can such grand claims be true?

yosemite_manifesto

 

I’m not sure if either the slide above or the original reference are reliable at the moment, so I’ll reproduce the text here:

  1. RDF is the best available candidate for a universal healthcare exchange language.
  2. Electronic healthcare information should be exchanged in a format that either: (a) is an RDF format directly; or (b) has a standard mapping to RDF.
  3. Existing standard healthcare vocabularies, data models and exchange languages should be leveraged by defining standard mappings to RDF, and any new standards should have RDF representations.
  4. Government agencies should mandate or incentivize the use of RDF as a universal healthcare exchange language.
  5. Exchanged healthcare information should be self-describing, using Linked Data principles, so that each concept URI is de-referenceable to its free and open definition.

I’m sure the signatories’ hearts are in the right place, but unfortunately the universal claims made here don’t stand up to scrutiny.

The basic claim is that we should all be using RDF for healthcare data exchange, or else a format that can map to it. I’ll quote the justification for this (my bolding):

Serving as moderator of the discussion, David Booth presented a number of compelling arguments in favor of RDF. First, RDF takes syntactic and formatting issues off the table so that the people using the data can focus on core semantic issues. Second, RDF is schema promiscuous, meaning that you can create multiple models of the same data using RDF and those models can co-exist peacefully without wreaking havoc on the whole system. Third, RDF is a neutral and mature international standard governed by the W3C. It has the support and depth necessary to handle such a monumental task.

The panel was quick to note that RDF is not a perfect solution, and there will certainly be a great deal of difficulty in uniting the private healthcare industry under the banner of RDF–government mandates and incentives will almost certainly be a necessary part of this process–but more than any other tool, RDF has the potential to simplify and standardize healthcare data in a way that will make it exponentially more searchable, thereby making healthcare more affordable and more effective.

Could any of this be true? Let’s remind ourselves what RDF is: it is a conceptual modelling language, whose statements are in the form of subject-predicate-object triples. It doesn’t have typing, since it is designed for representing conceptual relationships. It has an associated query language called SPARQL.

Now let’s consider the problem of health data exchange. In general this means:

  • Data travelling between systems, e.g. lab results system <=> hospital EMR system; insurance claims; statistical reporting to central government registers
  • Data travelling between back-end service and apps, e.g. patient nursing summary screen in hospital talking to EMR system
  • Extract / Transform / Load data extraction to generate study databases for secondary use

Nearly all of these systems today:

  • are built using some combination of object-oriented programming, and relational or other kinds of database technology. This means they have their own models, which are typed.
  • potentially process huge amounts of data, which means they use efficient representations for representation and transfer.

There is no end of interoperability problems of today’s health data systems (which is why I have worked on interoperable health solutions for many years). The above manifesto claims to solve the problems of differing physical representations (RDF makes this magically disappear apparently) while implying that semantic differences would be erased simply by expressing current data (models presumably) in RDF, and at the same time, querying would be ‘exponentially’ more effective.

If only it were so easy.

The problem of physical representation

Unfortunately, the manifesto authors seem to have misunderstood why there are diverse concrete formats in the world. Concrete formats are chosen depending on the task at hand. If representing peta-bytes of health data is the task, you need a space-efficient format (yes space does cost money, just get a quote for 10TB of high-availability storage from a cloud provider, and then one for 100TB, and see if they are the same) – invariably binary. If the job is moving small amounts of data between a back-end and diverse mobile apps, you need a light-weight web-friendly format – often JSON these days. If you are moving data between large systems that assume very standardised messages, you can use a format that strips the message schema information and just sends values; if not, the format needs to include some level of schema information. If privacy is an issue, then formats that support encryption and/or obfuscation are needed.

The diversity is endless. Turning everything into XML, RDF or some other silver bullet stops vendors actually achieving basic performance, volumetric and security aims. That’s why they don’t do it.

Data are based on typed, structured models

The other reason you can’t easily use RDF for most production data is that RDF is not typed – it only represents the conceptual relationships between entities. Most health data (like the data in other industries) is defined by typed, structural models.

Now, you can use RDF in its OWL form to do some of the job of UML, in a clunky sort of way, so that typing is available. But realistically, you might as well just use UML or any modern object-oriented and/or functional programming language. Better tools and a more advanced version of OWL may one day change this. But the point is, now we are talking about model representation, not data exchange.

The problem of semantics

The main problem with the claims here are that they gloss over the semantic differences of diverse healthcare data. Data from different systems and standards are built on different models. Sometimes these models are radically different. There is no quick way to make such differences go away. Converting the models to an RDF representation, even assuming that were possible, doesn’t make the differences go away. It might expose some of them better than the physical representations do, so a limited claim of the utility of RDF to help solve the semantic gap problem might make sense.

More searchable data

The above quote included the claim that health data in RDF would be ‘exponentially more searchable’, leading to better healthcare. What I suspect the authors are alluding to here is the idea that if only health data could be connected up, then querying would be able to find more facts, and make more conclusions. In theory that’s likely to be true. To achieve this you have to solve a) basic interoperabilty (just sharing the data) and b) the common semantic problem. Converting everything to RDF doesn’t make any of this go away.

So in end, the universal claims in the manifesto, particularly points 1., 2. and 5. are unsupported by the evidence and normal practices of software engineering and data processing.

Where RDF has value

Is RDF useful for anything in health? Certainly. It’s used to represent semantic models of various kinds, particularly ontologies, with projects to convert terminologies like SNOMED CT underway. Its utility here is that one can make assertions (e.g. in OWL) that provers can machine-process, to a) validate statements and b) generate inferences. For example, an EMR system may contain health records of patients from which the provider wants to find asthma sufferers. This is achieved by querying for e.g. asthma diagnoses, asthma medications, and asthma symptoms. Some of these will be coded, using codes from a terminology such as SNOMED CT. The various codes can be compared to the terminology to infer if the patient actually suffers from asthma. An RDF or RDF-like representation of the terminology enables semantic relationships like ‘is-a’ and ‘has-site’ to be easily represented and traversed.

Another possible use might one day be for representing content models, i.e. what we currently call Detailed Clinical Models (DCMs), archetypes, and so on. In my work, we use a language called Archetype Definition Language to do this. This is a constraint language that defines models in terms of constraints on an underlying information model. Various attempts have been made to replicate such models in OWL, over the period of a decade. Other approaches use XML-schema, UML and proprietary constraint languages. None that I am aware of uses RDF or OWL, because various namespacing and other issues have never really been resolved. Even if they were, I remain sceptical, since the real problem (in my view) is that archetypes and other DCM formalisms are Frame logic based, whereas RDF and OWL are designed for use as Description Logics.

Point 3. of the manifesto is a blanket injunction to use RDF to leverage the power of existing standards, models and vocabularies. This is already being done to some extent, but the reality is that real world representation problems are hard, and a decade of trying to apply OWL and RDF hasn’t produced any extraordinary breakthrough. Maybe such a breakthrough is possible, but in that case, some real evidence would be needed to back up the blanket claims of this manifesto, which ultimately aim to influence government funding.

In conclusion….

I appreciate that RDF and people using it have things to offer. However, the claim that they have the one format that will solve everything is not at all helpful to their cause, or anyone else’s. What we need is an idea of where and how RDF (and OWL) can be applied to the numerous specific and difficult problems in e-health, and just as importantly, where it is not appropriate.

Advertisements

About wolandscat

I currently work in e-health, and am senior architect of the openEHR.org specifications, designed for semantic interoperability of health information. I also designed the Archetype formalism and model used in openEHR. Outside of work, I am interested in guitar, travel, and philosophy.
This entry was posted in Health Informatics and tagged , , , , , , . Bookmark the permalink.

6 Responses to RDF for universal health data exchange? Correcting some basic misconceptions…

  1. David Booth says:

    Hi Thomas,

    Thanks for your critique of the Yosemite Manifesto. I appreciate your taking the time to look at this and think about these issues. However, some of the conclusions that you state seem to be based on an incomplete understanding of RDF and the intent of the Yosemite Manifesto, so I will try to clarify.

    The Yosemite Manifesto was intentionally very brief. The downside of that brevity is that a lot of things were left unsaid or unexplained. I won’t be able to explain them all here, but I would encourage you to read in more detail the materials from the workshop, which are available at http://dbooth.org/2013/semtech/ . We will also be holding a second annual workshop on this topic later this month at the 2014 Semantic Technology and Business Conference, and I would encourage you to come if you can: http://semtechbizsj2014.semanticweb.com/sessionPop.cfm?confid=82&proposalid=6666 .

    Addressing some specific statements in your post:

    “[RDF] is a conceptual modelling language”. Yes, but that’s not all it is. It is also a language for representing information content in a format-independent way. RDF captures information content, independent of data format. The exact same RDF content can be serialized in many different formats, such as JSON or XML. Or to put it the other way around, information content from any data format can be viewed as RDF, given a mapping. RDF is about the *information* content — not the data format.

    “[RDF] doesn’t have typing”. RDF *does* have typed literals. Usually established XML Data Types are used, but arbitrary other datatypes can be used also.

    “Nearly all of these systems today . . . have their own models . . . [and] use efficient representations for representation and transfer.” Yes, and the Yosemite Manifesto does *not* advocate using RDF **instead of** those representations. If two systems both understand the same efficient representation, then they should communicate using that representation. But sometimes they don’t, and this is why the Yosemite Manifesto recommends that a standard mapping be defined from each information representation language to RDF, as a common semantic representation. If there is a need to translate from one representation to another, the semantics of the data can be universally understood as the RDF that corresponds to that data, and translations can be uniformly defined in terms of that RDF.

    “The above manifesto claims to solve the problems of differing physical representations (RDF makes this magically disappear apparently)”. No, there’s no magic with RDF. RDF allows the problem of differing physical representations to be solved by decomposing it into two parts: (a) superficial *syntactic* differences between physical representations; and (b) deeper *semantic* (or structural) differences between those representations. For example, imagine representing the same information in both XML and JSON. If that information is represented using the same terms and the same structure in both XML and JSON, then the difference is merely a superficial syntactic difference. On the other hand, if that same conceptual information is represented using completely different data models and vocabularies, then there would also be a deeper semantic difference. RDF does not address the syntactic differences, but it does address the deeper semantic (or structural) differences by enabling relationships between differing data models and vocabularies to be captured explicitly (as RDF) and by enabling inferencing to be used to bridge between differing data models and vocabularies. This decomposition does *not* magically solve the problem, but it does make the problem *easier* to solve, both because you don’t have to deal with the synactic and semantic differences at the same time, and because it enables RDF to be used as a common intermediate language that captures the information content independent of data format.

    “Turning everything into XML, RDF or some other silver bullet stops vendors actually achieving basic performance, volumetric and security aims”. Agreed. But as explained above, the Yosemite Manifesto does *not* advocate turning everything into RDF. What it advocates is more nuanced. It *does* advocate that mappings be defined between all the various information representations and RDF, in order to have a standard way to interpret the meaning of data, and to enable that data to be translated into some other representation when necessary. We do want to be *able* to interpret any data as RDF (via its RDF mapping), but we should only actually transform it to RDF when necessary. As you point out, it would very inefficient to do that transformation all the time.

    “The other reason you can’t easily use RDF for most production data is that . . . . Most health data . . . is defined by typed, structural models.” RDF certainly can do typed, structural models. We do it all the time in RDF — hierarchical structures, relational structures and more. But RDF is different than data representations that impose a rigid schema. RDF is **multi-schema friendly**: multiple data models can peacefully coexist within the same RDF data, semantically interconnected. This is very helpful when integrating data from multiple data models or translating between data models. See the illustrations and explanation in “Key Things You Need to Know About RDF (And Why They Are Important)”:
    http://dbooth.org/2014/key/key-slides.pdf

    “The main problem with the claims here are that they gloss over the semantic differences of diverse healthcare data.” That is true, but it’s only because we wanted to keep the Yosemite Manifesto very short — not because we are unaware of those semantic differences. In fact, your further comments are spot on: “Data from different systems and standards are built on different models. Sometimes these models are radically different. There is no quick way to make such differences go away. Converting the models to an RDF representation, even assuming that were possible, doesn’t make the differences go away. It might expose some of them better than the physical representations do, so a limited claim of the utility of RDF to help solve the semantic gap problem might make sense.” That is *exactly* one of the values that RDF provides. You are absolutely correct that the need for semantic alignment is still there. RDF helps address that need in three main ways: (a) by putting the focus on the semantic alignment — which as you correctly point out is the real hard part — and decoupling the semantic alignment problem from the trivialities of different syntaxes; (c) by providing a common, open-standards-based semantic representation for any structured information; and (d) by supporting inference, which can simplify the task of semantic alignment (such as inferring that a v:MitralValve is also a v:HeartValve). You are absolutely right that RDF doesn’t make the semantic alignment problem disappear, but it does make it easier to solve.

    “The [article quoted] included the claim that health data in RDF would be ‘exponentially more searchable’”. That quote was written by a reporter who covered the event, and it must have been taken out of context, because it sounds rather far overblown as a general statement. I don’t know what the original speaker might have been discussing, but it was probably one particular success story in which such a dramatic improvement was observed *after* the hard word of semantic interconnection had been done. Your speculation is correct: “What I suspect the authors are alluding to here is the idea that if only health data could be connected up, then querying would be able to find more facts, and make more conclusions.” Indeed, one key motivation for using RDF is precisely to enable health data to be connected more easily.

    “[A] decade of trying to apply OWL and RDF hasn’t produced any extraordinary breakthrough”. I would partially agree with this, in the sense that the successes have been more gradual than sudden and explosive, and it has taken a number of years for RDF tooling to reach the maturity needed for commercial projects. But the successes are steadily accumulating, with no signs of ending. The development of ICD-11 is one good example. See “Using Semantic Web in ICD-11: Three Years Down the Road”:
    http://web.stanford.edu/~natalya/papers/iswc2013_ICD11.pdf
    There are many other successes as well, often more internal than publicly visible — just quietly getting the job done.

    Your post correctly identifies some of the value of RDF and OWL, but there are two other important benefits that should be mentioned. (You may have alluded to them, but I think they should be stated explicitly.) One is RDF’s use as an open, information integration medium. This is one of the main ways that RDF is used commercially, when an organization needs to integrate many or changing data models or vocabularies. The other is the ability to act as a common semantic foundation that can span across different information representation standards, data models and vocabularies.

    Healthcare has so many specialties and diverse use cases for data — clinical, quality measurement, administrative, research, etc. These use cases demand a rich variety of data models and vocabularies that each serve a particular purpose. We cannot expect that variety to disappear, but we can at least have a common way to express the semantics of, and relate, these different data models and vocabularies. That’s what RDF offers — in a mature, open, vendor-neutral, non-proprietary international standard.

    I hope this clarifies a few points. Thanks again for your critique of the Yosemite Manifesto. And if you can make it to this year’s workshop on “RDF as a Universal Healthcare Exchange Language” at the Semantic Technology and Business Conference in San Jose, it would be great to have your participation!
    http://semtechbizsj2014.semanticweb.com/sessionPop.cfm?confid=82&proposalid=6666

  2. wolandscat says:

    David,
    thanks for your reply. I will certainly follow up the links you have provided. However, I still do not see the justification for the original claim that RDF should be used as a universal health data exchange language. The need to use efficient engineering solutions for exchanging data in high-availability and large-data systems doesn’t go away, and RDF won’t help here in any way that I can see.

    Instead, I think that the main claim to utility for RDF in this space is in semantic model comparison, mapping and eventual remediation / consolidation. The result of that might be that data from different sources / standards (still mostly exchanged by efficient means) would be semantically closer to each other.

    The way we solve this problem in openEHR doesn’t get too concerned with physical formats; it works by a) defining a very solid, stable concrete information model, and then b) using a standard way of defining models of domain semantics (outside the information models and software entirely – see http://www.openehr.org/ckm) and then marking the data with the ids from those models.

    Many years ago (2004 I think) we thought OWL might be a better way to do this, and quite a lot of work was done, including by people from U Manchester. The problems we discovered with OWL (and I think any RDF-based representation) are that a) they don’t easily express the kinds of data constraints we needed (this in principle is easily fixable) and b) it was very difficult to enable types from an information model (data types, types like ‘Observation’, ‘Entry’ and so on) to co-exist with ‘types’ from the semantic model layer (e.g. ‘blood pressure observation’, ‘lipids result’ etc).

    None of the efforts since have solved this problem. Any RDF approach I know of treats things like ‘blood pressure observation’ as their own type. This prevents the building of efficient information processing software that can use the assumption that (to give on example) the timing information of any clinical information that is a time-series (BP, heart rate, breathing, …) is the same underlying formal definition. This is what a ‘reference model’ gives us.

    There may be ways to get around this that I am not aware of.

    In any case, I think RDF’s utility is very unlikely to be in the data exchange area, and much more likely to be in the model analysis area. However, even there, what is needed in industry is a proper connection between RDF and the typical UML, Ecore and OOPL tools in use in the mainstream (and I am no defender of UML…).

    In the end, I think claiming that RDF can univerally solve health data exchange obscures what RDF-based methods actually might solve.

    • Andy J says:

      “I still do not see the justification for the original claim that RDF should be used as a universal health data exchange language. The need to use efficient engineering solutions for exchanging data in high-availability and large-data systems doesn’t go away, and RDF won’t help here in any way that I can see.”

      David touched on this in his reply but perhaps it could be re-stated more explicitly: RDF is first and foremost an information representation model. That model is implemented in many different serialisation formats, each being more suitable to particular purposes. Perhaps the most well known is the RDF-XML format, and unfortunately the term “RDF” is often used to refer to this one specific serialisation format (with all the connotations of XML). Perhaps this is why you associate RDF with inefficiency, but there are many other serialisation formats including JSON and binary.

      The key point about these serialisation formats is that they all share the same RDF representation model, and thus they are completely interoperable despite their syntactic variability. Consider an example where you have some data stored in a relational database model and you need to encode it in JSON to serve to a client application. You as an application developer have to make decisions about how to convert a table-based representation into JSON, because neither tables nor JSON say anything about how the information is structured. How you will do this will depend on the data and indeed on the use case. What RDF does is standardise that structure. Technically you could achieve the same thing by getting everyone to agree to represent data as tables and agree on ways to encode tables in JSON (for example).

      Of course what makes RDF more useful than this is that it also helps (but does not solve) at the semantic layer. Most importantly, it forces everything about the data to be made explicit – moving such things out of separate schema definitions, UML models, PDF documents and application logic and into the data itself.

      Of course, any time you standardise something you give something up. Even binary RDF formats are not going to be efficient as an application-specific format written by a developer who knows that it only needs to store 3 of the 10 fields for an object. I expect this is why the authors of the manifesto are proposing RDF not as the only, or even the primary, representation of data in healthcare, but as an interchange format. Making it easier (note: not automatic) for applications to exchange data. If there are standard RDF formats for particular types of data (like EHRs) that define both the representation model (RDF) and the schema itself that would of course be ideal, but those two things can be done separately and even without that semantic unity, applications would still benefit immensely.

      • David Booth says:

        Excellent explanation! I’d like to add one other point: Data on the wire does not need to *look* like RDF in order to be *interpretable* as RDF. Any data format that has a standard mapping to RDF can be interpreted as RDF, and that is the key thing that is needed. That’s why the Yosemite Manifesto item #2 says: “Electronic healthcare information should be exchanged in a format that either: (a) is an RDF format directly; or (b) has a standard mapping to RDF.” It is also why step 2 of the Yosemite Project roadmap for healthcare information interoperability (YosemiteProject.org) is about defining mappings between existing healthcare information formats and RDF. Such mappings allow the meaning of the instance data to be uniformly intepreted in RDF.

  3. mjosborne says:

    For an article explaining why health data modelling should be a combination of the intensional and extensional methodologies, you can’t beat this, from people who know it and have done it for many years:
    http://informatics.mayo.edu/CIMI/index.php/Clinical_Models_and_SNOMED_Kaiser_Perspective

  4. I want to start by thanking both Thomas and David for a very civil discussion, in the face of clear disagreement. This has made the whole exchange very educational, and I even see some synthesis that might come out of the discussion.

    I believe there are a lot of parallels between this discussion and similar issues in the finance industry. There was a related question in a FIBO (Financial Industry Business Ontology) panel last week – about syntax lock-in. For the same reasons outlined here, we don’t want to propose a single syntax. But when we want to talk about interoperability between formats, we’d like the syntax to be out of the way. How do we talk about an element in an XSD, or a field in JSON, or a column in a RDB table, or a cell in a spreadsheet for that matter? If we want to say something about how one of these things relates to another, we at least need a way to refer to them. Preferably in a uniform manner. That is the role of the URI – it provides a way to refer to entities across a distributed system.

    Interestingly, this part of RDF, which is arguably the most valuable, is also the part that has already been vetted, tested and utilized on a very large scale. The problem of managing global references is part of the very infrastructure of the Web, and it works. RDF simply borrows it.

    This discussion has come to the agreement that RDF doesn’t solve any of the semantic mapping problems, and this is correct. But you can’t even approach semantic mapping problems unless you can refer to the things you want to map. That is the problem that RDF does solve, by using the infrastructure of the largest existing distributed system (the Web).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s