Tagging Information – Riding the Media Bits

The form of communication enabled by the 26 letters of the Latin alphabet is very effective but in general leaves out a wealth of other information that is present in the original (multimedia) message. If the sequence of characters is the transcription of the TV interview with a politician, the text will miss the inflexion of his voice, his sad or happy or concerned or angry face and the body gesture that may actually bring more information than the words themselves (assuming that his words convey much information). To cope with this limitation, over the centuries people found it necessary to add a number of characters or combination of characters, such as !, ?, …, !?, etc. to the 26 original letters to make the interpretation of the message easier, less subject to interpretation and more complete. Other conventions have also been used, such as writing words in capital letters, underlining, striking or writing them in bold or italic. Particularly with the advent of the Internet, emoticons such as 🙂 have become quite popular.

One way to schematise the above is to separate the content of a message into two parts: what can be expressed with characters and the rest. This is of course a very “character-centric” view of the world, that belies the attempt to build the complexity of multimedia communication from the bottom up, starting with characters. This approach espoused by computer scientists draws its motivation from the fact that characters were integrated in computers long time ago.

There is a philosophical basis to this. Saint John’s Gospel starts affirming that “In the beginning was the Word”, where word is logos in the Greek version of the Gospel, so it can be interpreted that everything started from rationality. Maybe (not the Gospel, the interpretation), but this is not what we experience in our daily life. The rationalisation of the world that gives rise to our words is a constant effort designed to minimise the impoverishment, if not the distortion, of reality that our words represent.

The separation advocated by computer scientists may have grounds in the Latin alphabet, but is largely lost when people communicated using Chinese characters where the very way the writer of the message uses his brush adds more about his feelings, or in a message written in Japanese where the very fact that certain Chinese characters have been used instead of hiragana or katakana (or viceversa) adds information.

Back to technology, markup is the name given by IT people to information that is “additional” to text. A human can use it to have a better clue as to the real meaning of the words pronounced by another human, a computer can use it to perform appropriate processing and a printer can use it to present some text in bold to catch the reader’s attention.

One of the reasons we have hundreds of printer drivers in our computers is because every printer uses special codes to make titles appear large, bold and centered, make paragraphs of a certain width with a bullet and an indent, and so forth. The situation is not so different from the day linotypes were in use. But that situation was understandable, if not commendable, because linotypes were closed machines with no need or intention to let them communicate with other machines. Today’s behaviour is nothing else but the continuation of a practice that dates back centuries ago when markup codes were used in manuscripts to give instructions to typesetters. The markup codes were meaningful only in the industry in which they were used, maybe even specific to one particular publisher.

In the early 1980s ISO’s TC 97 Data Processing started working on a markup language that eventually became the ISO 8879:1986 Standard Generalised Markup Language (SGML). An SGML document is composed of content – made up of characters – and markup – made up of markup characters. To distinguish between the two types of content, SGML inserts delimiter characters to indicate markup information. Two commonly used characters are open (“<“) and closed (“>”) angle brackets. A tag is then expressed as <anything>, the <> characters being the delimiters and “anything” being the markup code. The software processing the document will then know that the characters between “<” and “>” should be read in TAG mode, while the others should be read in CON (i.e. content) mode.

At the beginning, the group developing SGML thought of defining a set of universal tags. With this idea, once, say, “P”, “BR” and “H1” would be standardised, <P> would always mean new paragraph, <BR> would always mean a breaking point and <H1> would always indicate a first-level heading. This is the usual dilemma confronting developers of IT standards: something that is of immediate use, solving at least the most basic communication problem, or something that just gives the general rules that everybody can then customise? In the IT world the answer is regularly the latter, because “if there is something that I can do immediately, why should I share it with my competitors and create a level playfield”? SGML was no exception and it was so decided that SGML would not contain a set of standardised codes, but just a language that could be used to create a Document Type Definition (DTD). The DTD would define precisely the tags that would be used in a specific document.

SGML has been, in a sense, a successful standard, but only in closed environments, e.g. major printing organisations. No way such a complicated arrangement would work for the mass market in which the companies that developed Ventura, Word, WordPerfect and Wordstar, to name a few, battled for years with Word eventually becoming the word processing solution in the desktop environment. It should not be a surpise that the original Word format (or, better, formats) was entirely proprietary. A few years ago Office Open XML, that includes Microsoft Word and other applications of the Office suite, has been standardised as ISO/IEC 29500 Office Open XML File Formats. This uses XML, a derivation of SGML.