Document-Orientation

One of about 300 papers at http://avancier.website.

Lightly edited from the original paper by Phil Bryan; last updated 02/04/2017 23:18

This work is licensed under the Creative Commons Attribution 4.0 International License.

To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

“to no one will we sell, to no one deny or delay right or justice” …. Magna Carta, clause 40.

This document uses the label Document Store for what some call document databases or document-oriented databases.

Contents

Preface. 1

The ever-increasing importance of documents in human society. 1

Traditional (usually relational) databases. 3

Document Stores. 4

Trends and implications for enterprise and solution architects. 7

Preface

Written documents have played a large part in the history of mankind.

They have enabled writers to command, instruct, persuade and inform readers; and entertain them on occasions.

The mountainous accumulation of documents in the developed world attests to their huge importance.

They are so commonplace and ubiquitous they are taken for granted.

But will documents continue to be relevant in information age?

Will they be superseded by continually changing online content such as Wikis, Twitter and 24hr news feeds?

This paper considers the enduring appeal and influence of the humble document.

How documents were challenged by the rise of online databases.

Recent shifts in document formats and database technologies.

And the convergence of document stores and other kinds of database.

The ever-increasing importance of documents in human society

A document is “an exposition of text, extending to a finite length, expressed in a human-readable language and structure.”

How did documents come to be so successful?

Human speech is transient, easily misheard and misremembered.

Documents provide persistent records that can be studied over and over, and shared.

They are written and read using languages already known to writers and readers

To read a document seems effortless; practically no physical resource or energy is expended.

Documents carry poetry and prose, directions, descriptions and messages.

They provide us with compelling, contemporaneous evidence of historical entities and events.

(The 11 century Domesday Book has been called a “triumph of the written record”.)

They have enabled the formalisation of social systems into business systems.

They declare treaties and laws; they record births, marriages and deaths.

They are used in the conduct of academic, educational, medical, military and other business affairs.

Document technology has evolved, making it easier produce documents.

Scribes carved the first writing on wet clay tablets, perhaps 5,000 BC

About 3,000 BC, the Egyptians invented papyrus, and by about 500 BC most people in West Asia and the Mediterranean used it.

Surviving clay tablets and papyrus documents tell us of those people’s beliefs, stories and ways of life.

Parchment and vellum (animal skins) were the main document technology for about 20,000 years

By the 15th century, they were replaced by paper - easier to manufacture but not so persistent.

Handwriting was gradually supplanted by typing and printing.

Nowadays, the key boards and visual display units of computers are increasingly pervasive.

The success of documents depends also on their portability.

Initially, messengers carried single documents from senders to receivers.

Mechanised postal systems grew into enormous enterprises that deliver letters to all parts of the globe.

Digital documents can now be stored and exchanged electronically - requiring no transportation of physical materials.

This has enabled documents to be broadcast across large distances well-nigh instantaneously.

The importance of the document context to understanding

Natural language is inherently fluid and ambiguous.

A document makes sense to its readers only in what is called a “domain of knowledge” or “bounded context”.

To extract the writer’s meaning, the reader must understand that context, and the language used in it.

Documents are born in a context, in a set of circumstances; they pertain to particular events or facts in particular situations.

The author(s), publication date(s), format and target audience are typically disclosed inside a document or upon its cover.

The context appears also in the language, style and content of the document, and in references to other documents or information sources.

A “bounded context” is also fluid.

The terms and concepts used within a domain of knowledge change and grow over time.

You can append text to a document – add extra information that illuminates, explains, supplements and even revises its meaning.

(Moreover, the meaning intended by a writer is open to reinterpretation by readers.

Perhaps, if that were not so, our collective knowledge could not advance.)

Traditional (usually relational) databases

Documents have been widely used and perfected over millennia.

Databases are a very recent innovation, and not so human-friendly.

Databases are designed to hold precisely-defined structured facts about things that a business needs to remember in order to complete business processes.

Database Management Systems manage the storage and retrieval of those facts as data in databases.

The efficient storage and retrieval of data comes at a cost that is now being questioned.

To understand this cost, it is necessary to re-examine the database design process.

Databases require the text in documents to be dismembered and structured into tables.

A table contains rows that each represent a thing of interest, an entity (e.g. customer) or event (e.g. order) in the business domain.

Analysis and design typically proceeds in steps, along these lines.

Document definition (messages, displays, forms and reports)

Define documents needed/used to perform activities in roles or processes.

Define documents created in the course of performing activities in roles or processes.

Define the documents’ data items in a data dictionary or canonical data model.

Database design

Analyse I/O documents to find “entities” identified by primary keys.

Define a logical data model - by “normalising” and relating the entities.

Define additional audit trail data to be stored (data entry time, place etc.)

Code the data model as a database schema using the chosen Database Management System.

Refine the database design to ensure processes are performant.

Design to ensure the database satisfies CIA and scalability requirements.

It is difficult to develop a database schema containing scores or hundreds of discrete entities.

Sometimes the attempt to create a single coherent, consistent schema is overambitious for the need, or at least, for the immediate need.

Worse, once the database is populated with data, it is even more difficult to change the database schema, since a data migration exercise is needed.

A whole section of the IT industry has grown to support the development and management of databases, at great cost to the businesses that rely upon them.

However, over the last decade, several alternatives to the traditional database have been proposed, including Document Stores (also known as document databases or document oriented databases).

Aside on Object-Relational Mapping (ORM)

ORM software products (such as Active Data Objects and Java Data Objects) address the inflexibility of a traditional persistent database structure.

They make data from a traditional database available to programmers in the transient memory of an “application server”.

These transient data structures may resemble document structures and can be easily changed between program executions.

However, ORM software cannot decrease the cost or increase the flexibility of an underlying data storage structure.

And they are not an alternative to Document Stores.

Document Stores

Document Stores are designed to hold complete human-readable documents and messages.

The documents are stored and indexed without being “normalised” or dismembered into tables

(They may however undergo algorithmic, lossless data compression.)

The database has no schema!

Instead, the vocabulary and grammatical structure of each document is declared (often by reference to a schema document) in the document itself.

The vocabulary (typically English) is human-readable, and doesn’t have been altered for storage-efficiency.

However, the grammatical structure is limited to that of regular expressions, because that is the only grammar readable by a computer.

And the only grammar you can define using a data flow language such as XML or JSON.

So, the documents and messages are defined using self-describing data structure formats such as:

· XML - a data format used in OASIS message schemas and in OAGIS Business Object Documents (BODs)

· SAP Intermediate Documents (IDocs).

Aside: Don’t confuse OASIS and OAGIS.

OASIS is a standards body that publishes standard schemas for XML messages.

OAGIS is a specific standard for business-to-business integration from a different standards body, the OAG.

OAGIS BODs and SAP IDocs are different from OASIS “for information” messages.

It isn’t just that a document is a collection of related content that is stored for future reference.

The crucial point is that documents are created and used in the natural course of business processes.

The source, timing and sequence of document creation and receipt may be highly significant to their processing and needs to be recorded.

The Document Store design process

The storage structure of a Document Store (like the denormalised structure of data warehouse) is less economical than a relational database.

But it is unhelpful to think of a Document Store as a denormalised database, because the development lifecycles are different.

This table compares the two processes.

Investigate any controlled vocabulary or conceptual model used in the business context.
Document definition (messages, displays, forms and reports)
Define documents needed/used to perform activities in roles or processes. Define documents created in the course of performing activities in roles or processes. Define the documents’ data items in a data dictionary or canonical data model.
Database design	Document Store design
Analyse I/O documents to find “entities” identified by primary keys.
Define a logical data model - by “normalising” and relating the entities.	Define each input document as a logical structure (a regular expression)
Define additional audit trail data to be stored (data entry time, place etc.)	Define additional context data to be stored (document provenance information etc.)
Code the data model as a database schema using the chosen Database Management System.	Code the input data structures using the chosen data format standard (XML, JSON, whatever)
Refine the database design to ensure processes are performant.	Refine the database design to fulfil output requirements, while keeping input documents intact.
Design to ensure the database satisfies CIA and scalability requirements.

Document history/audit trail

Document Stores can efficiently store the context data (aka metadata or document schema) that describes a document.

This greatly simplifies records of a document’s provenance and history.

The audit trail of document creation and update is held at the document level (rather than the entity level).

This makes it easier to understand the history of a document and who has updated it.

The history is visible to and understandable by the end user (not requiring analysis of an entity-level audit trail).

Data variety

Document Stores can hold a wide variety of data structures in one data store, and apply the power of computing to that data.

Varieties include:

· Largely unstructured narrative documents, in which large chunks of text appear as single elements (e.g. research papers or news media).

· Documents containing finer-grained data elements (e.g. email with elements for To, From, Copy, Date, Subject etc.).

· Less-well described data conveyed using a mark-up language (which may not convey the meaning of tagged terms).

· OAGIS BODs and SAP IDocs

· Structured order, invoice, payment messages defined using a standard like the OASIS Universal Business Language (the XML version of EDIFACT).

Again, OAGIS BODs and SAP IDocs are different from OASIS messages.

Input to output transformation

Documents Stores typically store input documents in the structure received, as it was originally expressed, using the original vocabulary and grammar.

By keeping all parts of the document intact, the context and history of the input document are preserved.

They can however store documents in the structure required for output (the better serve to queries, searches or reports).

Document Stores are typically used where the inputs documents need not be transformed for output.

Or else, where the transformation of input documents to output documents is straightforward.

Where the transformation of input documents to output documents is complex, or ad hoc outputs are required, a more traditional database may be a better solution.

Flexibility to accommodate change

Elements can be added to or removed from a general document structure to create a revised document structure, without changing older document instances. (Configuration management of document structure revisions unsurprisingly adds complexity to the administration of Document Stores).

Queries can read through a file of documents that were stored before and after such a change

Or course, there are limits to what kinds of change can be made without impacting attempts to analyse and report on stored data..

Changing the name or meaning of an element (say email address to contact address, or order number to stock number) will likely cause headaches.

But generally speaking, Document Stores accommodate business and technical change more easily than traditional databases.

Data manipulation operations

Create, update and delete operations apply to whole documents.

Operations are often accomplished using a query language that is analogous to SQL

However, it should be said that query languages (such as XQuery) are still maturing.

Data integrity

Relational constraints between documents may be enforced by the database, if necessary, using relational or graph “multi-model” database technologies.

Document Stores, being schema-less, can adopt more flexible approaches to the management of constraint breaches e.g. warning and information messages issued in place of fatal errors.

Data migration

Document Stores, being schema-less can reduce or eliminate the need for data migration.

Data can be manipulated and transformed in-situ by means of stepwise modification of data integrity constraints.

I.e. Fatal errors relaxed to allow the application of scripts, then subsequently re-applied.

Business intelligence

It is claimed that operational Document Stores need not be supplemented by data warehouses.

Business object documents remain as they were created, reflective of business operations as they happened.

So, they do not need to be reassembled in a separate database to provide business intelligence.

More advantages of Document Stores

They can include reference to the real-world event that initiated the need for the document to be created e.g. receipt, adjustment.

They allow synchronisation between databases at the document level (rather than data element level) eliminating issues of inconsistent data representation.

They allow database transaction mechanisms such as multi-version concurrency control (MVCC) to operate more simply at the document level.

Trends and implications for enterprise and solution architects

Document Stores are not a panacea for every data storage problem.

They may not be the best choice where there are compex input-to-output transformations and/or data integrity rules, or ad hoc queries are needed

Nevertheless, they are an option to be considered by solution architects, data architects and enterprise architects.

Particularly where the primary requirement is to store, display and exchange documents (in the widest meaning of that term).

Document Stores are beginning to compete with more traditional Database Management Systems.

Both in terms of features such as query languages and ACID transactions, and in terms of price-performance.

Numerous Document Store technologies are used, and available in the cost-efficient form of “platform as a service”.

Beside open source MongoDB, there are now proprietary offerings such as DocumentDB, from Microsoft (Azure) and MarkLogic, from MarkLogic Corp.

Heterogeneous solution architectures

The days of the homogeneous enterprise systems supported by traditional relational databases are numbered.

The drift an Enterprise Architecture is towards a heterogeneous technology environment.

Architects must learn to bring multiple solution components together into a solution architecture, in which Document Stores may form a key part.

Multi-model databases

There is a trend for traditional Database Management Systems to include Document Store features alongside relational and graph database features.

If such “multi-model” databases can deliver on their promises, their future adoption may be driven by organisations seeking lower total cost of ownership and simplified development processes.