On Additive and Subtractive Formalisms

In IT, there are numerous formalisms or languages which enable the definition of entities of some kind, in terms of more primitive elements. Formalisms used for ‘modelling’ (rather than say query definition) tend to fall into a small number of categories, which function either ‘additively’ or ‘subtractively’. Mixing these two modes of definition is almost always a recipe for problems.

Object-Oriented Languages

The type of formalism in use in mainstream software development today is the object-oriented (OO) language, with or without functional facilities (curried functions, lambdas etc). Programming languages such as Java, C#, Python, Ruby, TypeScript, C++, PHP as well as the UML and openEHR BMM fit this description. While supporting encapsulation of data and behaviour, like other module-based formalisms, the defining characteristic of an OO language is inheritance, which is a facility enabling the progressively specialised definition of classes down lineages, by inheriting from more basic ancestors. Genericity (template classes) may also be supported. Classes provide the definition of types, which are in turn templates for data instances.

Inheritance, and therefore OO, are additive paradigms in the sense that any class definition adds to and/or overrides elements of its inheritance ancestor(s). Any specialised class thus contains differential elements with respect to its ancestors. The effective definition of the type for a class is arrived at by flattening the definition elements (data, methods, constants etc) down an inheritance lineage leading to the class.

Inheritance also leads to polymorphism, which is the ability for dynamic attachment of instances to references of more general statically defined types, e.g. instances of Circle or Square to attach to an attribute shape of type Shape, where Circle and Square are classes inheriting from Shape.

Some practical consequences of object-oriented modelling formalisms include:

  • ‘base’ classes, i.e. top-level and near top-level classes, are very general, and should contain few features (attributes and methods);
    • -> a well-known anti-pattern is the ‘god’ class, filled with attributes relating to more specialised types, e.g. a class Animal that has features like Wingspan, TuskLength and EggSize which should only belong to classes like FlyingAnimal etc. This blog post explains consequences for e-health standards.
  • where an attribute or function could return objects of multiple types at runtime, in the design-time model it must be specified as being of an abstract parent type of the intended concrete types allowed at runtime.
    • a well-known anti-pattern is for pseudo-parent types to be created that do not define any coherent common semantics, to enable runtime substitution of objects of arbitrary types.

Ontology Languages

Languages for authoring ontologies are conceptually relatively simple, while being rigorous. The conceptual basis is Aristotelian, and consists of entities (aka types, or universals) being described in terms of a more general parent (i.e. inheritance) plus differentiae (often formulated as ‘a B is an A that c’, e.g. a mammal is a vertebrate that nourishes young with milk). Formal ontology languages such as OWL are thus additive formalisms.

Constraint Formalisms

Constraint formalisms are those that define statements or structures that apply to artefact expressing in modelling formalisms, usually to reduce the instance space according to specific semantic or domain rules. The OMG’s Object Constraint Language (OCL) is one relatively well known such formalism, and allows class invariants, and routine pre- and post-conditions to be applied to UML model elements. W3C’s Xquery is another constraint formalism, that works by progressively applying constraints to a data set (XML content) in order to generate a final result set matching specific criteria.

openEHR’s ADL (original version 2002; adopted as ISO 13606-2 in 2008, 2019) is another constraint formalism that includes the equivalent of invariants, as well as structural and value-based constraining. It operates on UML class models, although concrete implementations today all use BMM-expressed models. ADL-expressed archetypes can be understood as something like classes or types at the domain level, in the sense that one object model class, say Observation can be constrained into archetypes for hundreds of specific kinds of observations such as vital signs, eye exam, lab tests and so on.

Constraint formalism can be understood as reductive or subtractive, since they generate refined variants of a basic model concept by adding constraints which reduce the set of data instances that will match a definition. For example, only a few Observation instances in a database will conform to an ADL blood-sugar Observation archetype.

Relational Databases and Queries

Relational databases consist of tables, which are multiple rows of a typed tuple (the set of column name:type definitions). A row in a table contains data roughly equivalent to an instance of a class in an object model. Fields may be values or relations, specified as primary/foreign key pairs. The main characteristic we are interested in here is how querying works. The essential structure of an SQL query is given by the standard syntax:

SELECT cols FROM some_table WHERE value_constraints

The first two parts generate a view, which is a projection defined by the SELECTEDed columns (cols) from the total available columns, i.e. the original table definition. Imagine there is a table with 26 columns, each named with the letters of the alphabet, from ‘a’ – ‘z’ – this is the ‘some_table’ argument in the SELECT / FROM / WHERE statement above. The SELECT part is a particular subset of columns, say a, c, h, i, j. The WHERE part determines what rows will be included in the result, but the important part is that the SELECT projection defines a view on the original table. This is the primary way new instance definitions (which we might think of as new ‘types’ in OO theory) are derived from existing ones (tables, or previously generated views). This is a subtractive paradigm – each new view is a reduced version of its predecessor.

A SQL query as a way of generating a new ‘type’ is thus a very different paradigm from object-orientation, which is additive, as described above. Confusion between relational thinking and object-orientation was one of the main reasons for the problems in the HL7v3 RIM, and the approach to generating dependent models, i.e. RMIMs and CMETs. Specifically, the RIM was modelled in UML and presented as an object-oriented model, but most of the RIM classes were ‘god classes’, and were treated like relational tables as if they were a basis for generating both additive classes – RIM sub-classes and also subtractive views – the RMIM message definitions.

Mixed formalisms

Some formalisms contain a mixture of additive and subtractive semantics, usually a recipe for problems. Probably the best known is W3C XML-schema, which supports two kinds of inheritance, ‘restriction’ (subtractive logic) and ‘extension’ (additive logic). A series of specialised schemas may use both kinds, resulting in schemas that are hard to analyse in modelling environments as well as for run-time use.

Making things more difficult is the fact that the rules for inheritance of tag attributes are different for elements (sub-objects). For these reasons, XML-schema is not considered an OO formalism, nor generally used as a primary modelling formalism in the IT industry, but rather generated from other models as a way of concretely defining XML document contents.

XML schema also supports the notion of arbitrary type choice, which bypasses any concept of inheritance-based typing, by allowing the type of an element to be one of an arbitrary set. This kind of modelling results in significant extra complexity in software that deals with XML documents, in the form of if / then / elseif logic chains with minimal logic re-use via polymorphic invocation.

This 2010 paper by Suad Alagic covers the many difficulties of mapping XSD to OO, and has this to say about ‘choice’:

XSD choice represents a major problem for OO interfaces to XML. Specifying a fixed number of subtypes of a type is contrary to the core features of the OO model. Because ofthe lack of a suitable representation for choice, some OO interfaces use the same representation for choice and sequence groups. This representation has nontrivial implications because these two types of groups have different semantics. In fact, widely known OO interfaces to XML do not have a suitable representation of XSD groups and its three subtypes (i.e., sequence, choice, and all groups). There are many more problems in mapping XSD schemas to OO schemas [7].
(18) (PDF) Mapping XSD to OO schemas. Available from: https://www.researchgate.net/publication/226053508_Mapping_XSD_to_OO_schemas [accessed May 08 2019].

What’s Wrong with Mixing Additive and Subtractive Logic?

The essential problem with mixed formalisms that are not cleanly based on additive or subtractive logic is that they conflate two quite different concepts: type-based definition and untyped, ad hoc projection. Subtractive (i.e. constraint) logic is what we find in the SELECT clause of a query, and the ad hoc type choices of certain ADL archetypes – it doesn’t define anything, and may violate the type system of the underlying model. A definition, such as a FHIR resource, that uses both types of logic thus tries to function as a type definition and a query, without being either.

Formalism Layering

Formalisms that are additive or subtractive along the lineage of definition specialisation (the inheritance lineage in the OO case) both have their uses, including potentially within the same model ecosystem. To be successful, the first rule is to separate their artefacts to different modelling layers, so that representation and processing of any one dimension of model uses only one kind of logic. The second rule is to use the additive formalism / modelling first (lower layer(s), and then the constraint formalism(s). Formalisms that enable the definition of entities with mixed additive and subtractive logic at the finest level will run into trouble in implementation.

In the UML framework, UML (additive OO logic) and its constraint counterpart OCL are clearly separated within UML models: the UML part can easily be processed on its own, with OCL statements being processed separately with respect to the UML structures they annotate, or even ignored (as happens in most UML tools).

In the openEHR model framework, information models are defined in UML and also represented for machine processing in BMM. A separate layer of models called archetypes is defined in the Archetype Definition Language (ADL), which is a constraint-logic based formalism. An archetype is a pure set of constraints applied to constructive type definitions from the underlying information model. This enables coherent tools to be written for each layer, and for the logic that applies to the different levels to be easy to understand by modellers concerned with each one.