Firstly, we need to remember what problem we are trying to solve here. It is: find a universally acceptable representation (formalism + technology) for health content modelling. Now, we have to remember that we are in the infancy of this whole area of computing, which we might call semantically-enabled multi-level modelling. I.e. a style of computing that a) uses semantic underpinning in the form of ontologies & terminologies and b) that uses modelling layers above the software information model layer. Being in its infancy, we need two things:
- purpose-built formalism(s): it is a mistake to try and solve a problem in the first instance without developing dedicated formalisms and tools (after having determined, of course, that it really is a different problem than already solved by existing formalisms etc). The reason is that solving a complex problem is necessarily bound up with growing an understanding of the problem. Dedicated formalisms and tools allow this to be done, indeed they express exactly the current understanding of the problem. Attempting to understand the problem and its solution but only by recourse to adapting non-purpose built tools means a long battle with those tools. It is not just that you have to fight with the formalism to implement what you think you want – it is worse: it becomes very difficult to capture in any clean way your understanding of the problem, much less explain it to the rest of the team. In the end, years can be spent trying to morph tools, formalisms, and maybe even convince relevant SDOs, to fit the task at hand.
- agility: during any phase in which science (or industrial research, if you like) is being done (as opposed just to engineering, where the science is already known), agility and flexibility are needed. If a new realisation is developed, then one needs to be able to move fast to make changes and recontinue based on the upgraded formalism and tools. This is already hard enough with a dedicated formalism and tools, but with general-purpose tools and formalisms over which no direct control can be exercised, it is extremely difficult indeed.
In the end it is the difference between having a clear statement and understanding of problem and solution, versus a rough approximation of the same, the latter requiring a lot of effort. It is like the difference between developing a new kind of boat from bicycle parts and scrap steel versus building exactly what you want from timber and fibreglass. Once the prototype is done, then it can be manufactured in steel. In other words, we can’t say that UML/OCL/MOF etc might not one day grow to encompass the needed semantics – and it is in my view a possibility (depending on where OMG take UML/OCL). But in order for that to be achieved efficiently, we need a clear statement / model of what we actually want. Using a dedicated formalism to do this means that you can freely develop the entirety of this statement relatively quickly, test it with implementations, and show it to e.g. OMG, Eclipse/Ecore project or other relevant bodies. Trying to develop the statement in existing UML tools just means endless waiting for organisations to change / enhance formalisms and tools, a step at a time, or being on a custom-tool-hacking path yourself. But if you are going to do the latter, you might as well be released from the shackles of the existing formalisms.
A comparison. Consider OWL as an example. Most of the intellectual progress of OWL has been done using the abstract form of the OWL language, because it is the only way its developers could both reason mathematically about semantic nets and compute with them. And it turns out that even today, when we consider the problem substantially solved, or at least substantially progressed, noone is saying ‘ok, we can drop OWL now, we’ll just do this in UML’. You can’t, because UML wasn’t built to efficiently express semantic network graphs and reasoners, even though if you tried really hard, it might be possible to force it to do so. At the concrete level, of course we have the OWL-RDF XML serialisation to enable easy low-level computing and interoperability.
The archetype formalism is in a similar situation. Replicating its capability with the UML of today is hard. To get a feel for how hard, consider that since ADL 1.4 (3 versions down the track from 1.0) was put into wide use in 2006, around 25 issues have been discovered, described here. These are just the changes between ADL 1.4 and ADL 1.5, all driven by implementation evidence. Try and think of how hard it would be do this with UML/OCL tools & OMG standards. Now consider the speed at which archetype tools have been developed: there are parsers in at least 4 languages; there is a full-featured online model repository whose design is 80% driven by the requests of modellers; there are downstream artefact generators for XSDs and APIs that have been deployed in production contexts for over 2 years. Another crucial feature of archetypes (to my knowledge not supported by any of the other formalisms) is a completely model-based and portable querying language (known as AQL, specified here). This is now in production use and populating complex HIS screens. Although funding for all of this has in fact been very limited, none of it would have been remotely possible if tied down by the weight of other non-adapted formalisms and tools.
What of other candidate work efforts? None of this is to say that anyone should stop working on the UML-based activity that Kevin, VA, LRA etc are respectively engaged in. On the contrary, these activities are being pursued in order to solve needs in their relevant contexts. In fact, I think that if each of the ‘UML’ groups could pursue their main work (solving specific implementation problems) and also have at their disposal a dedicated modelling facility & tools for the content modelling part, their work would be enhanced. It would require some bridging from the dedicated formalism, including some tool integration. But I would suggest that this approach will in fact accelerate the progress in these development contexts, because it means it separates the scientific development pathway for modelling from their local engineering concerns (which of course may be national in scope). Trying to do both activities in a single development formalism and environment is in some sense possible, but will inevitably create ongoing confusion and tension between two sets of needs.
Now, the downside. Pursuing the above means building new tools, parsers etc. Some people think that this is tremendously difficult, but in fact it is not that difficult. It does need people and knowledge who can built quality parsers, that’s for sure. But with the right skills, the tools are easy enough to achieve. This has been going on for about 8 years now, hence the availability of parsers in 5 languages. And consider that doing any customisation of tools like EMF or EA is not for the faint-hearted (until recently, EMF/Ecore didn’t even support container types in the Ecore meta-model – it required custom changes). Another ‘downside’ is having to write a new specification. But this is also an upside. Such a specification stands as an absolutely clear, dedicated statement of syntax and semantics, unobstructed by any other concerns.
Plurality. Lastly, we need to consider that there are more clinical modelling alternatives than just the various UML/OCL ones (that is already a ‘plural’ situation!) There are various HL7/CDA environments (not all the same), Tolven’s TRIM development environment and undoubtedly numerous others not represented in the CIMI forum. Even if we choose one of these technologies, and adapt it from its original purpose to the job at hand here, we still need to create bridges to all the others. Creating such bridges between a central choice that is necessarily already an approximation, to each of the other concrete technology environments is going to be hard, and I think largely unachievable due to competing needs on funding.
The semantic underpinning. There is also OWL, which I would put into a different category, because it solves some semantic problems that need to be solved anyway. I am not convinced that it is that useful as a primary concrete technology for large scale health data processing because it has only a very weak connnection with information models / database schemas, and at the high-performance end of health computing, these matter. OWL I think instead will find a number of uses, including:
- as an underpinning ontology for archetypes (think something like OGMS). In CKM we already have a simple OWL ontology for this purpose;
- potentially as a semantic validation mechanism for archetypes (today we can do a lot of technical validation with compilers, but only humans can do proper semantic validation)
- for runtime inferencing during query resolution.
(Note that quite a lot of work has been done in openEHR on converting archetypes in and out of OWL, and we have worked directly with Alan Rector’s group in the past on the transformations, so the community has some experience here).
In summary. I strongly believe we need a clean, coherent formalism and technology stack at the centre, and well-defined and engineered bridges to other CIMI development environments. This will provide clarity and enable issues to do with the central formalism to be clearly distinguished from those within the various target environments: a proper separation of technical and organisational concerns.
Two final points.
- On the data types: we need a set of data types for clinical modelling that support clinical modelling needs. The openEHR data types come closest to this in our experience, and indeed have largely been moulded to the requirements of that modelling. However, they are sufficiently different from HL7 and 21090 that I don’t expect them to be accepted there, and in any case, some simplifications and improvements could certainly be made in hindsight. Hence, our recommendation would be to start with Grahame Grieve’s RFH data types as the starting point, and develop from there. Although 21090 is unfortunately compromised by its subtractive modelling approach, I would recommend that the 21090 document as a requirements reference, as it covers numerous use-cases.
- On the reference model (RM): our experience with the openEHR reference model is similar to above; it was developed largely in response to clinical modelling, as well as other research experience. It rests on 20 years general research in health information thinking, in fact more, if we go back to Weed’s POMR. As I have said in the past, the way to understand the RM in this context is not as a single model (which engenders some kind of war between openEHR, 13606, CDA, whatever else), but as a set of semantic patterns underpinning the modelling stack. The openEHR RM is not perfect, and indeed openEHR 2.x is being specified right now. It will probably include various simplifications that are designed to bring openEHR and 13606 into a single model within ISO in 2012; elements from CDISC as well as process modelling, and higher-level concepts used commonly in the Intermountain and other major environments. However what is there has largely shown to be extremely well-adapted for large numbers of archetypes. The most obvious example is the Observation type, and its underlying History-of-events & data/state/protocol patterns. Doing it in e.g. HL7 RIM or CDA (at the same level of detail) is possible but really quite hard. So our recommendation here is to base the CIMI RM on the key patterns from the openEHR RM (i.e. not all of it), and keeping in mind that it is open for change heading into openEHR 2.x, and people and organisations here can be involved in the evolution.
The next CIMI meeting is on 29 November, in London. Let’s see what happens.