No SQL databases, documents and data – some misunderstandings

A good friend pointed me to this post: why you should never use MongoDB. It’s a very interesting post, about how bad MogoDB turned out to be for dealing with social network data. It’s not that MongoDB is bad per se, just that you have to understand what it is, what it could be used for, and when it won’t work.

It reminded me of my first reaction when I read the MongoDB Architecture whitepaper. It talks about ‘documents’ but what MongoDB stores is not documents, it’s serialised graphs, i.e. object structures. A ‘document’ is something authored, and whose content is fixed, until the authors next update it. An object serialisation is completely up for grabs. You can serialise a small piece of an object graph, say the data representing an Actor in a system like IMDB, or you could serialise the Actor + all his/her movies, or … you see where it’s going. One problem with dumb serialisation is that if you don’t take account of the difference between ‘association’ relationships (e.g. Actor – Movie) and ‘part-of’ links (e.g. Actor – PersonName, assume Actor is some kind of Person, which has a PersonName) you can end up ‘serialising the world’.

What MongoDB is really doing is storing ‘blobs’ (blob = binary large object; binary JSON in MongoDB’s case), which means you serialise some part of an object graph into a lump and store that. You need to choose judiciously, and not serialise into one tree an Actor and all her Actor mates (recursively) – otherwise you’ll get everyone who’s ever known Matthew McConnaghy in bed together on the disk, and that might be ugly. And it will take a huge amount of storage. And won’t support referential integrity. And have terrible performance.

Blobs have nothing to do with documents, except arguably when documents happen to be the source data objects (but even then, they can be blobbed in different ways). With blobbing, you can change the granularity of blobs – e.g. do it with

whole of Actor (i.e. whole of Person),
just chunks of Person info, e.g. Address, PersonName etc, or
something else.

In a relational DB used in the classical way, you don’t blob anything, you have a schema that fully expresses all the data, including various kinds of relationships all over the place; in a way it’s equivalent to the finest possible grain of blobbing of terminal primitive nodes like Strings, Integers etc.

Does it matter that No-SQL people misuse the term ‘document’? Well I would say at the very least it’s very confusing. MongoDB has built what I would call an optimising indexed blob database. Their documentation and the general misuse of the term ‘document’ in the discourse on No-SQL DBs will mislead the unwary.

Is there anything wrong with MongoDB? Probably not, but I have not investigated. My impression is that implementers of certain solutions, such as the authors of the post quoted at the top didn’t initially understand what a blob DB is, or how to use it. I don’t know yet the details of MongoDB’s support for relationships, or how good its querying is, but I would assume there is a correct way to use it for any kind of data.

The main rule for using blobs is: you can only create a blob from a data graph whose internal relationships are either:

have composition (aka ‘cascade-delete’) semantics; OR
are already references, e.g. URIs or IDS pointing to external entities or similar; BUT NOT
direct associations to shared / independent objects.

Then you have to manage the other relationships in a different way, quite probably using normal relational tables. Lastly, you need smart indexes that understand paths-in-blobs and a query engine that also understands paths. You might also go to the extent of dynamic adaptive blobbing and re-indexing, depending on frequency of access etc. This is very pertinent in health data, where a great deal of older data is not that useful, but specific items are essential. With the right heuristics, a DB based on blobbing could dynamically self-optimise to an extent. Some old ideas on this here.

The more interesting point is probably that you can use a normal relational DB to implement an indexed blob-store. This is what many implementers of openEHR, ISO 13606, HL7 CDA and other health data standards do, and is common in many industries. MongoDB and other no SQL DBs may well have lessons to teach us on doing this better.

I would be interested to see more experience reports of MongoDB and other no SQL DBs used with complex data containing both part-of and association relationships.

About wolandscat

I work on semantic architectures for interoperability of information systems. Much of my time is spent studying biomedical knowledge using methods from philosophy, particularly ontology and epistemology.

View all posts by wolandscat →

4 Responses to No SQL databases, documents and data – some misunderstandings

Koray Atalag says:

15/11/2014 at 06:40

I’d be keen to evaluate storing openEHR data with Intersystems’ Cache?

- Ian McNicoll says:
  
  13/01/2015 at 14:54
  
  There has been at least one openEHR solution developed with Cache. I understand that it worked pretty well.
  
  - Birger says:
    
    03/02/2015 at 17:54
    
    Hi Ian, can you provide some more details on that implementation? There might be one master thesis exploring object databases (on the example of cache) for storing openEHR. Maybe some prior experience might help to get a better start with this.
Seref says:

16/02/2016 at 19:50

Well, a late comment quite) but still worth mentioning a few things. First of all, Mongo is atomic at the document level (using mongo terminology, replace document with blob), and documents have a max limit of 16 megabytes. Which means, if you can’t find a way of fitting all data that you want to insert/update in a transactional context in 16 megabytes, you lose the guarantee provided by relational dbs that either everything will be written or nothing. Mongo people will go to great lengths to convince people that there are workarounds for this, but I for one am not buying that argument. If you want to manage associations using different blobs, say goodbye to transactional operations (hey, who cares if you fail to write the last document/blob that says the patient is allergic to something, right?) The web generation loves Mongo for the things it does well, and it does a lot of things well, but as an EHR implementer I’m just watching the industry in shock 🙂