XML presents a very versatile way to store, edit and process semistructured documents. But it’s reputation suffered because it has been used for many things it was never intended for. This article explores some valid and invalid applications of XML and tries to distil a set of principles which help to decide if XML is the right choice for a problem.
Introduction
As XML is such a versatile way to store and retrieve information it has been used for almost any kind of application. From configuration files to databases or even as the syntax of a programming language (XSLT).
After 20 years and a slew of uses, it is clear that XML not going away soon, even as its demise has been foretold and valid criticism expressed. However, a variety of alternative formats and approaches have made it more and more clear, where XML is not the right choice.
This article discusses how XML, or better a derivative thereof, position themselves among these alternatives. It also indicates some criteria for the choice of one technology over another.
Quick aside on the religious wars: It's clear that everything can be done with everything. It's always been the case with programming. But luckily we don't have to use assembler for everything or JavaScript if you want.
Relational data
Relational data is very well understood and databases are mature and stable. It interconnects atomic bits of data such as strings, integers or booleans in tables to build a queriable structure. Queries allow to assemble data in varying combinations often called views. The same piece of data can often be part of an infinite amount of different views.
This is where relational data differs from the hierarchical data typically represented by XML-documents. The meaning of each piece of data is very much tied to its position in the hierarchy. Breaking up the hierarchy and recombining the elements makes the data incomprehensible.
This leads us to a first criterion for the suitability of XML: Are the elements of the data-structure recombineable or is there only one true view ?
The fact that many relational databases include an XML data-type is somewhat telling. XML documents are usually considered being somewhat atomic and not splittable.
Data versus presentation
Many of the successful applications of XML carry one common trait: They are presentation oriented. Let’s see some examples:
- SVG
- HTML
- DocBook
- OpenDocument
- MathML
- GML
A lot of damage has been done by presenting XML as a means to completely separate content from presentation. That's exactly not one of the areas where it shines.
On the other hand, many of these formats have also been successful because they are not too tied to an exact final look (because then we could use the same language for all these applications). They rather describe the authors intent and how he sees the connection between the different elements.
This is what people usually mean when they talk about semantic markup.
Tags were even added to the PDF standard in 2004 in order to support differing ways to present the document (then mostly screen readers).
Tagged files versus hierarchically structured data
The one true view criterion is also valid in some areas where XML has been replaced by other markup languages or data stores. This was mostly for the better. The most prominent ones are:
- JSON
- YAML
- No-SQL databases
All of them (and XML) keep their data in a tree-like structure. Where all of them differ from XML is, that they usually impose much more structure. Each piece of data needs to be tagged in a way. It's a top-down approach. The markup approach for documents is mostly rather bottom-up.
The difference is especially visible in so called mixed content:
XML:
<para>This is an example of
<strong>mixed content</strong>
in <abbrev>XML</abbrev></para>
JSON:
"para": {
"text": "This is an example of",
"strong": "mixed content",
"text": "in",
"abbrev": "JSON"
}
Obviously, XML integrates text and tagged bits of text far more naturally than other hierarchical data stores.
To be clear, we are discussing about syntax here. But as mentioned in the introduction: everything can be done with everything. This leads us to some more philosophical musings.
Philosophical aside
If it's just a difference if syntax, then why should we even bother? This is a pretty fundamental question in computer science. Why would we need more than one programming language? All of them are Turing-complete anyway. So they can basically all do the same things. The eternal answer to this question is: because of humans.
A programming language is nothing but an interface between developer and machine. A generalized interface is usually not very efficient. That's why each domain has its own interfaces. For example, it's more efficient to input music with a piano keyboard compared to clicking with a mouse on the screen. In return, the point-and-click interface is applicable to a much broader set of problems.
That's why we only need to argue why XML is more efficient and closer to the way humans think about documents than other methods.
The payoff is clear: the smaller the difference between the mental model and the input or the display, the less "translation-errors" will be produced.
Is XML really a better UI for document editing?
The only evidence, that XML has a more natural syntax for marking up typical documents is, that it is still widely in use for this purpose. This might be due to a real superiority or just because it has become the go to solution. Even though my intution tells me that it is mainly the first reason, I have no proof of this.
The only way to really know would be by the means of some thorough usability testing with persons which are not used to any of the syntaxes in use.
Tools
A last and very pragmatic criterion for choosing our not choosing an XML based format would be the tools that are available.
In the XML world there are three main tools to be considered which can help shorten development time and reduce errors:
Schemas
The availability of a validation framework for semi-structured data is pretty unique in the XML world. Schemas allow to define a required minimum of structure while retaining flexibility where possible.
XPath
Finding a set of elements in a document without having to think about loops and recursiveness? Pretty feasible with XPath. It's basically CSS selectors on steroids. And everyone loves CSS selectors.
XSLT
A language to transform an XML document in any other output format.
Not necessarily for everyone. But for very complex transformations, it has proven to produce results that are pretty comprehensible. When used in a functional way, it is also pretty secure to make changes to transformations that one doesn't completely understand. All this without running the risk of breaking everything in the process.
Conclusion
An XML based format is clearly the right choice if:
- There is one true view and elements are not to be split up and reordered
- The data implies a specific presentation like a map or a document
- The data is loosely structured as a tree (semistructured)
- The data needs to be directly manipulated by humans