All posts by admin

Tagging Information

The form of communication enabled by the 26 letters of the Latin alphabet is very effective but in general leaves out a wealth of other information that is present in the original (multimedia) message. If the sequence of characters is the transcription of the TV interview with a politician, the text will miss the inflexion of his voice, his sad or happy or concerned or angry face and the body gesture that may actually bring more information than the words themselves (assuming that his words convey much information). To cope with this limitation, over the centuries people found it necessary to add a number of characters or combination of characters, such as !, ?, …, !?, etc. to the 26 original letters to make the interpretation of the message easier, less subject to interpretation and more complete. Other conventions have also been used, such as writing words in capital letters, underlining, striking or writing them in bold or italic. Particularly with the advent of the Internet, emoticons such as 🙂 have become quite popular.

One way to schematise the above is to separate the content of a message into two parts: what can be expressed with characters and the rest. This is of course a very “character-centric” view of the world, that belies the attempt to build the complexity of multimedia communication from the bottom up, starting with characters. This approach espoused by computer scientists draws its motivation from the fact that characters were integrated in computers long time ago. 

There is a philosophical basis to this. Saint John’s Gospel starts affirming that In the beginning was the Word”, where word is logos in the Greek version of the Gospel, so it can be interpreted that everything started from rationality. Maybe (not the Gospel, the interpretation), but this is not what we experience in our daily life. The rationalisation of the world that gives rise to our words is a constant effort designed to minimise the impoverishment, if not the distortion, of reality that our words represent. 

The separation advocated by computer scientists may have grounds in the Latin alphabet, but is largely lost when people communicated using Chinese characters where the very way the writer of the message uses his brush adds more about his feelings, or in a message written in Japanese where the very fact that certain Chinese characters have been used instead of hiragana or katakana (or viceversa) adds information. 

Back to technology, markup is the name given by IT people to information that is “additional” to text. A human can use it to have a better clue as to the real meaning of the words pronounced by another human, a computer can use it to perform appropriate processing and a printer can use it to present some text in bold to catch the reader’s attention. 

One of the reasons we have hundreds of printer drivers in our computers is because every printer uses special codes to make titles appear large, bold and centered, make paragraphs of a certain width with a bullet and an indent, and so forth. The situation is not so different from the day linotypes were in use. But that situation was understandable, if not commendable, because linotypes were closed machines with no need or intention to let them communicate with other machines. Today’s behaviour is nothing else but the continuation of a practice that dates back centuries ago when markup codes were used in manuscripts to give instructions to typesetters. The markup codes were meaningful only in the industry in which they were used, maybe even specific to one particular publisher. 

In the early 1980s ISO’s TC 97 Data Processing started working on a markup language that eventually became the ISO 8879:1986 Standard Generalised Markup Language (SGML). An SGML document is composed of content – made up of characters – and markup – made up of markup characters. To distinguish between the two types of content, SGML inserts delimiter characters to indicate markup information. Two commonly used characters are open (“<“) and closed (“>”) angle brackets. A tag is then expressed as <anything>, the <> characters being the delimiters and “anything” being the markup code. The software processing the document will then know that the characters between “<” and “>” should be read in TAG mode, while the others should be read in CON (i.e. content) mode. 

At the beginning, the group developing SGML thought of defining a set of universal tags. With this idea, once, say, “P”, “BR” and “H1” would be standardised, <P> would always mean new paragraph, <BR> would always mean a breaking point and <H1> would always indicate a first-level heading. This is the usual dilemma confronting developers of IT standards: something that is of immediate use, solving at least the most basic communication problem, or something that just gives the general rules that everybody can then customise? In the IT world the answer is regularly the latter, because “if there is something that I can do immediately, why should I share it with my competitors and create a level playfield”? SGML was no exception and it was so decided that SGML would not contain a set of standardised codes, but just a language that could be used to create a Document Type Definition (DTD). The DTD would define precisely the tags that would be used in a specific document. 

SGML has been, in a sense, a successful standard, but only in closed environments, e.g. major printing organisations. No way such a complicated arrangement would work for the mass market in which the companies that developed Ventura, Word, WordPerfect and Wordstar, to name a few, battled for years with Word eventually becoming the word processing solution in the desktop environment. It should not be a surpise that the original Word format (or, better, formats) was entirely proprietary. A few years ago Office Open XML, that includes Microsoft Word and other applications of the Office suite, has been standardised as ISO/IEC 29500 Office Open XML File Formats. This uses XML, a derivation of SGML.


The World Wide Web

The World Wide Web (WWW) is a globally network of interconnected information elements represented in a standard form using a dialect of SGML called HyperText Markup Language (HTML), using the HyperText Transfer Protocol (HTTP) for the transport of information, employing a uniform addressing scheme for locating resources on the Web called Universal Resource Locator (URL), a set of protocols for accessing named resources over the Web and a growing number of servers that respond to requests from browsers (or clients) for documents stored on those Web servers. In a simple, but not entirely correct, comparison HTML is the information representation (that corresponds to, say, MPEG-2 Video) and HTTP is the information transport (that corresponds to, say, MPEG-2 Transport Stream). Of course MPEG-2 does not have the notion of “hyperlink”.

A summary sequence of the events that led to the creation of the WWW can be described by the following table:

Year Name Description
1989 Tim Berners-Lee, a physicist working at the Centre Européen pour l’Énergie Nucléaire (CERN) in Geneva, writes a document entitled “Information Management: A Proposal”
1990 WWW The proposal is approved and Tim starts working on a program that would allow linking and browsing text documents. Program and project are called WorldWideWeb
1991 Line mode browser is developed and WWW is released on CERN machines
1992 HTML HTTP
There are 26 reasonably reliable servers containing hyperlinked documents. The major technical enablers of the work are HTML and HTTP
1993 Mosaic Marc Andreesen and others employees of National Center for Supercomputer Applications (NCSA) in the USA create the first user friendly, “clickable” Internet browser. The browser is given away free for trial to increase the publicity of Mosaic
1994 Netscape Marc leaves NCSA to start his own company, first called Mosaic Communications Corp and later Netscape Communication Corporation. The success of the Netscape browser is immediate
1994 W3C The immediate success of the WWW triggers the establishment in of the World Wide Web Consortium at MIT. CERN discontinues support of these non-core activities and transfers them to the Institut National pour la Recherche en Informatique et Automatique (INRIA) in France. A W3C centre is hosted later at Keio University in Japan. In recent years W3C has become a major Standards Developing Organisation producing Information Technology standards. 
1995 (March) WWW traffic surpasses ftp traffic
1997 The threshold of 1 million WWW servers is crossed

When Tim Berners-Lee needed a simple way to format web pages, he got inspiration from SGML but that standard did not suit his needs. So he developed the HyperText Markup Language (HTML), a simplified form of SGML that uses a pre-defined standard set of tags. This shows that when IT people address a mass market, they know very well what they must do. I bet that if he had introduced SGML without any standard set of tags, we would not have the billions of web pages that we have in the today’s world . 

The basic structure of an HTML document is

<HTML>
<HEAD> Something here </HEAD>
<BODY> Something else here </BODY>
</HTML> 

From this example one can see that an HTML document is contained between the pair <HTML> and </HTML> and that an HTML document consists of two main parts: the Head, and the Body, each contained between the pair <HEAD> and </HEAD> and the pair <BODY> and </BODY>, respectively. The Head contains information about the document. The element that must always be present in the Head is the <TITLE> tag and is the one that appears as a ‘label’ on the browser window. A tag that may appear in the HEAD part is <META> and this can be used to provide information for search engines. The Body contains the content of the document with its tags. 

Imagine now that I want to create a document that contains the centred and bold title “Address list” and two elements of a list, like this one:

Address list 

Employee ID: 0001 

  • Leonardo Chiariglione 
  • leonardo@cedeo.net
  • 0001

Employee ID: 0002

  • Anna Bugnone 
  • anna@cedeo.net
  • 0002

In HTML this can be represented as

<HTML>
<HEAD>
<TITLE>Address list</TITLE>
</HEAD>
<BODY>
<CENTER><B>Address list</B></CENTER>
Employee ID: 0001
<UL>
<LI>Leonardo Chiariglione</LI>
<LI>leonardo@cedeo.net</LI>
<LI>0001</LI>
</UL>
<P>
Employee ID: 0002
<UL>
<LI>Anna Bugnone</LI>
<LI>anna@cedeo.net</LI>
<LI>0002</LI>
</UL>
</BODY>
</HTML>

In this HTML document the pair <CENTER> and </CENTER> indicates that “Address list” should be displayed as centred, the pair <B> and </B> indicates that the character string “Address list” should be displayed as bold, the pair <UL> and </UL> indicates that a bulleted list is included and the pair <LI> and </LI> indicates an item in the list is included between the pair. <P> is an instruction to the interpreter to create a new paragraph. 

In the first phases of the web evolution, the IETF managed the development of the HTML “communication standard”, but did it only until HTML 2.0, known as RFC 1866. In the meantime different business players were waging wars among themselves to control the browser market, each trying to define its own HTML “format” by adding its own tags to the language that would only be understood by their browsers, and not by their competitors’. W3C took over the standardisation of HTML of which there is version 4.0 with about 90 different tags and HTML5, of which we will say more later.

Search engines

The large number of servers (already 1 million in 1997 and 100 times more in 1995) containing millions of pages (and now many billions) prompted the development of search engines. In essence a search engine performs 3 functions: crawling, i.e. browsing of the WWW to find web pages; indexing, i.e. storing suitably indexed information found during browsing in a data base; and searching, i.e. responding to a user query with a list of candidate links to web pages extracted from the data base on the basis of some internal logic.

There were several attempts, even before the appearance of the WWW, to create search engines. One of the early successful attempts was Digital Equipment’s AltaVista launched in 1995 at altavista.digital.com, the same year Larry Page and Sergey Brin started the Google project. The AltaVista search service was an immediate success but the Google search engine started in 1998 and based on more powerful research, soon overwhelmed AltaVista (and other search engines as well). AltaVista later became available at www.altavista.com but today that URL leads to the Yahoo search page. Today the search business is highly concentrated in the hands of Google (~2 out of 3 queries), Bing (~1 out of 5 queries) and Yahoo! (~1 out of 10 queries). The search engone business model is invariably based on advertising in the sense that if I type “new houses in Tuscany” it is likely that the search engine will post an ad of a real estate developer next to the search results (probably some general articles related to Tuscany), because I am likely interested in a new house in Tuscany.

If the search engine treats all my queries as independent, the search engine will do a poor job at guessing what I am actually looking and also in displaying relevant ads. So the search engine collects a history of my searches, creates a model of myself in its servers, uses “cookies” in my browser to know that the query is coming from me, and displays more effective responses and ads. All good then? Not really because there are two main concerns. The first is that someone is collecting such detailed information about billions of people and the second that the logic by which search results are displayed is unknown. This is creating apprehension in some – especially European – countries that feel uneasy about so much information being in the hands of such huge global companies that have become virtual monopolies in such important businesses as providing information to peoples’ queries and collecting a growing percentage of money spent on advertising. As a consumer I notice that a search engine is just drawing information from me for free and reselling it to advertisers. Even more seriously I fear that the growing exposure of people to the information – voluntarily – accessed from search engines can shape what people think in the long run.

It is a very complex problem but the root of the problem is that search engine users do not know anything about the logic that drives presentation of information to them. This is correct because that algorithm is what makes a search engine better than the competition. Still, asking to entrustthe brains of billions of people without any scrutiny is too much.

I believe that standards can come to help. I certainly do not intend to suggest “standardisation of search engines”. This makes no sense because it would mean stopping progress in an area that has seen so much progress in the last 20 years. But having a commonly shared understanding of architecture and subsystem interfaces of search engined could help. The layered architecture in Figure 1 below is very general, but it says that a search engine is not a monolith and relies on interfaces between functions at different leyers.

interfaces_for_search_engines

Figure 1 – A layered search engine architecture

Interfaces in the architecture could provide a way for Public Authorities to guide correct behaviours and for search engine users to peek inside their search engines. Much better is to have alternative standard ways of communication such as the one defined by the Publish/Subscribe Application Format.

Collaborative editing

The browser soon became the vehicle to realise in an effective way one old idea of creating documents collaboratively. In 1994 Ward Cunningham developed WikiWikiWeb (in Hawaiian wiki means quick), as a shared database to facilitate the exchange of ideas between programmers. It became the first example of wiki,  a website whose pages can be easily edited and linked with a browser by anybody without the need of familiarity with HTML.

Jimmy Wales and Larry Sanger launched the Wikipedia project in 2001, originally as as a wiki-based complementary project for Nupedia, an online encyclopedia project which was not progressing sufficiently fast because it was edited solely by experts. Wikipedia has been an outstanding success. Today (2005) it has articles in 270 languages and number of the articles written in English is about 5 millions. English is the most important language but only 15% of all articles are written in that language. The next language is German with about 5% or all articles.

While admiring the succes of the idea I cannot help but noting that it is odd that an article about a (living) person can be written by a different person. This leads to odd results like the Wikipedia article on Leonardo Chiariglione – not written by me – saying wrong or incorrect things about myself.

XML

Extensible Markup Language (XML) is another derivation of SGML developed by W3C, if not literally, at least in terms of design principles. An XML element is made up of a start tag, an end tag, and data in between. The start and end tags describe the data within the tags, which is considered the value of the element. Using XML, the first employee in the HTML document above could be represented as

<EMPLOYEE>
<ID>0001</ID>
<NAME>Leonardo Chiariglione</NAME>
<EMAIL>leonardo@cedeo.net</EMAIL >
<PHONE>0039 011 935 04 61</PHONE >
</EMPLOYEE>

The meaning of <EMPLOYEE>, <ID>, <NAME>, <EMAIL> and <PHONE>, obvious to a human reader, still do not covey any meaning to a computer, unless this is properly instructed with a DTD. The combination of an XML document and the accompanying DTD gives the “information representation” part of the corresponding HTML document. It does not say, however, how the information should be presented (displayed) because this is the role played by the style sheet. Style sheets can be written in a number of style languages such as Cascading Style Sheet Language (CSS) or eXtensible Style Language (XSL). A style sheet might specify what a web browser should do to be able to display the document. In natural language:

Convert all <EMPLOYEE>  tags     to  <UL> tags
Convert all </EMPLOYEE>  tags     to  </UL> tags 
Convert all <NAME>     tags     to  <LI> tags 
Convert all </NAME>     tags     to  </LI>  tags 
Convert all <EMAIL>     tags     to  <LI>  tags 
Convert all </EMAIL>     tags     to  </LI>  tags 
Convert all <PHONE>     tags     to  <LI>   tags 
Convert all </PHONE>     tags     to  </LI>   tags.

Unlike HTML where “information representation” and “information presentation” are bundled together, in XML they are separate. In a sense this separation of “representation” of information from its “presentation” is also a feature of MPEG-1 and MPEG-2, because a decoder interprets the coded representation of audio and video streams, but the way those streams are presented is outside of the standard and part of an external application. In XML the “external application” may very well be an HTML browser, as in the example above.  

On the other hand, MPEG standards do not need the equivalent of the DTD. Indeed the equivalent of this information is shared by the encoder and the decoder because it is part of the standard itself. It could hardly be otherwise, because XML is a very inefficient (in terms of number of bits used to do the job) way of tagging information while video codecs and multiplexers are designed to be bit-thrifty to the extreme. 

The work that eventually produced W3C XML Recommendation started in 1996 with the idea of defining a markup language with the power and extensibility of SGML but with the simplicity of HTML. Version 1.0 of the XML Recommendation was approved in 1998. The original goals were achieved, at least in terms of number of pages, because the text of the XML Recommendation was only 26 pages as opposed to the 500+ pages of the SGML standard. Even so, most of the useful things that could be done with SGML, could also be done with XML. 

 

 

 

 

 

W3C has exploited the description capabilities of XML for other non-textual purposes, e.g. the Synchronized Multimedia Integration Language (SMIL). Like in HTML files, a SMIL file begins with a <smil> tag identifying it as a SMIL file, and contains <head> and <body> sections. The <head> section contains information describing the appearance and layout of the presentation, while the <body> section contains the timing and content information. This is the functional equivalent of MPEG-4 Systems composition. MPEG has also used XML to develop a simplified 2D composition standard called Lightweight Application Scene Representation (LASeR).

XML inherited DTDs from SGML, but it has become apparent that some shortcomings were also inherited, such as the different syntaxes for XML and DTD requiring different parsers, no possibility to specify datatypes and data formats that could be used to automatically map from and to programming languages and no set of well-known basic elements to choose from. 

The XML Schema standard improves on DTD limitations. It creates a method to specify XML documents in XML and includes standard pre-defined and user-specific data types. The purpose of a schema is to define a class of XML documents by applying particular constructs to constrain their structure. Schemas can be seen as providing additional constraints to DTDs or a superset of the capabilities of DTDs


Inside MPEG-7

Before starting this page I must warn the reader that MPEG-7 has more elements of abstractness, compared to previous MPEG standards that may make reading it more difficult than other pages. With this warning, let’s start from some definitions that will hopefully facilitate understanding of the MPEG-7 standard.

Element Definition Examples
Data The audio-visual information to be described using MPEG-7. MPEG-4 elementary streams, Audio CDs containing music, hard disks containing MP3 files, synthetically generated pictures or drawings on a piece of paper.
Feature A distinctive characteristic of a Data item that means something to somebody. Colour of a picture, particular rhythm of a piece of music, camera movement in a video or cast of a movie.
Descriptor A representation of a Feature. It defines the syntax and semantic of the representation of the Feature. Different Descriptors may very well represent the same Feature. Feature “colour” that can be represented as a histogram or as a frequency spectrum.
Descriptor Value An instantiation of a Descriptor for a given Data set. Descriptor Values are combined through the Description Scheme mechanism to form a Description.
Description Scheme The specification of the structure and semantics of relationships among its components. These can be Descriptors or, recursively, Description Schemes. The distinction between a DS and a D is that a D just contains basic data types and does not make reference to any other D (and, obviously, DS). A movie that is temporally structured in scenes, with textual descriptions at scene level and some audio descriptors of dialogues and background music.

In the following, Ds and DSs are collectively called Description Tools (DT).

The figure below represents the main elements making up the MPEG-7 standard.

MPEG-7_elements

Figure 1 – The main MPEG-7 elements

MPEG-7 provides a wide range of low-level descriptors.

MPEG-7 Visual Tools consist of basic structures and Ds that cover the following basic visual features: Color, Texture, Shape, Motion and Localisation. 

The Color feature has multiple Ds. Some of them are: 

Name Descriptor
Color Quantization Expresses colour histograms keeping the flexibility of linear and non-linear quantisation and look-up tables. The audio-visual information to be described using MPEG-7.
Dominant Color(s) Represent features where a small number of colours suffice to characterize the color information in the region of interest.
Scalable Color Useful for image-to-image matching and retrieval based on colour feature.
Color Structure Captures both colour content (similar to a color histogram) and information about the structure of this content whose intended use is for still-image retrieval because its main functionality is image-to-image matching.
Color Layout Specifies the spatial distribution of colors that can be used for image-to-image matching and video-clip-to-video-clip matching or for layout-based retrieval for color, such as sketch-to-image matching.

The Texture feature has 3 Ds. 

Name Descriptor
Homogeneous texture Is used for searching and browsing through large collections of similar looking patterns. An image can be considered as a mosaic of homogeneous textures so that these texture features associated with the regions can be used to index the image data. Agricultural areas and vegetation patches are examples of homogeneous textures commonly found in aerial and satellite imagery.
Texture Browsing Provides a perceptual characterization of texture, similar to a human characterization, in terms of regularity, coarseness and directionality. The computation of this descriptor proceeds similarly to the Homogeneous Texture D. First, the image is filtered with a bank of special filters. From the filtered outputs, two dominant texture orientations are identified. Then the regularity and coarseness is determined by analysing the filtered image projections along the dominant orientations.
Edge histogram Represents the spatial distribution of five types of edges: four directional edges and one non-directional edge. Edge histogram can retrieve images with similar semantic meaning, since edges play an important role for image perception.

The Shape feature has 4 Ds. Region-Based Shape is a Descriptor to describe any shapes. This is a complex task because the shape of an object may consist of either a single region or a set of regions as well as some holes in the object or several disjoint regions. 

The Motion feature has 4 Ds: camera motion, object motion trajectory, parametric object motion, and motion activity. 

Name Descriptor
Camera Motion Characterises motion parameters of a camera in a 3-D space. This motion parameter information can be extracted automatically or generated by capture devices.
Motion Trajectory Describes the motion trajectory of an object, defined as the localisation, in time and space, of one representative point of this object. In surveillance, alarms can be triggered if some object has a trajectory identified as dangerous (e.g. passing through a forbidden area, being unusually quick, etc.). In sports, specific actions (e.g. tennis rallies taking place at the net) can be recognised.

MPEG-7 Audio Tools. There are seventeen low-level Audio temporal and spectral Ds that may be used in a variety of applications. While low-level audio Ds in general can serve many conceivable applications, the Spectral Flatness D specifically supports the functionality of robust matching of audio signals. Applications include audio fingerprinting, identification of audio based on a database of known works and, thus, locating metadata for legacy audio content without metadata annotation. 

Four sets of audio Description Technologies – roughly representing application areas – are integrated in the standard: sound recognition, musical instrument timbre, spoken content, and melodic contour. Timbre is defined as the perceptual features that make two sounds with the same pitch and loudness sound different.

Name Descriptor
Musical Instrument Timbre Describes perceptual features of instrument sounds. The aim of the Timbre D is to describe the perceptual features with a reduced set of Ds that relate to notions such as “attack”, “brightness” or “richness” of a sound.
Sound Recognition Indexes and categorises general sounds, with immediate application to sound effects.
Spoken Content Describes words spoken within an audio stream. This trades compactness for robustness of search, because current Automatic Speech Recognition (ASR) technologies have their limits, and one will always encounter out-of-vocabulary utterances. To accomplish this, the tools represent the output and what might normally be seen as intermediate ASR results. The tools can be used for two broad classes of retrieval scenario: indexing into and retrieval of an audio stream, and indexing of multimedia objects annotated with speech.

from this cursory presentation, one can easily see the range of tools offered to content owners and to application developers. The MPEG-7 Ds have been designed for describing a wide range of information types: low-level audio-visual features such as color, texture, motion, audio energy, and so forth, as illustrated above; high-level features of semantic objects, events and abstract concepts; content management processes; information about the storage media; and so forth. Most Ds corresponding to low-level features can be extracted automatically, whereas more human intervention is likely to be required for producing higher-level Ds. 

The MPEG-7 Multimedia Description Schemes part of the standard defines a set of DTs dealing with generic as well as multimedia entities. Generic entities are features used in audio, visual, and text descriptions, and are therefore “generic” to all media. These are, for instance, “vector”, “time”, etc. More complex DTs are also standardised. They are used whenever more than one medium needs to be described (e.g. audio-visual). These DTs can be grouped into 6 different classes according to their functionality as in the following figure.

mpeg-7_MDS

Figure 2 – The MPEG-7 Multimedia Description Schemes 

Elements Definition
Basic Elements Facilitate creation and packaging of descriptions.
Content description Represents perceivable information.
Content management Information about the media features, the creation and the usage of the AV content.
Content organization Represents the analysis and classification of several AV contents.
Navigation and access Specifies summaries and variations of the AV content.
User Interaction Describes user preferences and usage history.

Basic Elements define a number of Schema Tools that facilitate the creation and packaging of MPEG-7 descriptions, a number of basic data types and mathematical structures, such as vectors and matrices, which are important for audio-visual content description. There are also constructs for linking media files and localising segments, regions, and so forth. Many of the basic elements address specific needs of audio-visual content description, such as the description of time, places, persons, individuals, groups, organisations, and other textual annotation. 

Content Description describes the Structure (regions, video frames, and audio segments) and Semantics (objects, events, abstract notions). Structural aspects describe the audio-visual content from the viewpoint of its structure. Conceptual aspects describe the audio-visual content from the viewpoint of real-world semantics and conceptual notions. 

The Content Management DTs allow the description of the life cycle of the content, from content to consumption including media coding, storage, and file formats and content usage.

Name Description Tools
Creation Information Describes the creation and classification of the audio-visual content and other material that is related to the audio-visual content: Title (which may itself be textual or another piece of audio-visual content), Textual Annotation, and Creation Information such as creators, creation locations and dates.
Classification Information Describes how the audio-visual material is classified into categories such as genre, subject, purpose, language, and so forth. It also provides review and guidance information such as age classification, subjective review, parental guidance, and so forth.
Related Material Information Describes whether other audio-visual material exists that is related to the content being described. Usage Information describes the usage information related to the audio-visual content such as usage rights (through links to the rights holders and other information related to rights management and protection), availability, usage record, and financial information. Media Description describes the storage media such as the compression, coding and storage format of the audio-visual data.

Content Organisation organises and models collections of audio-visual content and of descriptions. 

Navigation and Access facilitates browsing and retrieval of audio-visual content by defining summaries, partitions, decompositions, and variations of the audio-visual material. 

Name Description Tools
Summaries Provide compact summaries of the audio-visual content to enable discovery, browsing, navigation, visualization and sonification of audio-visual content.
Partitions and Decompositions Describe different decompositions of the audio-visual signals in space, time and frequency. Variations provide information about different variations of audio-visual programs, such as summaries and abstracts; scaled, compressed and low-resolution versions; and versions with different languages and modalities – audio, video, image, text, and so forth.

User Interaction describes user preferences and usage history pertaining to the consumption of the multimedia material. This allows, for example, the matching between user preferences and MPEG-7 content descriptions in order to facilitate personalisation of audio-visual content access, presentation and consumption. 

The main tools used to implement MPEG-7 descriptions are DDL, DSs, and Ds. Ds bind a feature to a set of values. DSs are models of the multimedia objects and of the universes that they represent. They specify the types of the Ds that can be used in a given description, and the relationships between these Ds or between other DSs. The DDL provides the descriptive foundation by which users can create their own DSs and Ds and defines the syntactic rules to express and combine DSs and Ds. 

The Description Definition Language can express spatial, temporal, structural, and conceptual relationships between the elements of a DS, and between DSs. It provides a rich model for links and references between one or more descriptions and the data that it describes. The DDL Parser is also capable of validating Description Schemes (content and structure) and D data types, both primitive (integer, text, date, time) and composite (histograms, enumerated types). MPEG-7 adopted XML Schema Language as the DDL but added certain extensions to satisfy all requirements. The DDL can be broken down into the following logical normative components: the XML Schema structural language components; the XML Schema datatype language components and the MPEG-7 specific extensions. 

The information representation specified in the MPEG-7 standard provides the means to represent coded multimedia content description information. The entity that makes use of such coded representation of the multimedia content is an “MPEG-7 terminal”. This may be a standalone application or a part of an application system. The architecture of such a terminal is depicted in the figure below. 

MPEG-7_terminal

Figure 3 – Model of an MPEG-7 terminal

The Delivery layer, placed at the bottom of the figure, provides MPEG-7 elementary streams to the Systems layer. MPEG-7 elementary streams consist of consecutive individually accessible portions of data named Access Units. An access unit (AU) is the smallest data entity to which timing information can be attributed. MPEG-7 elementary streams contain Schema information, that defines the structure of the MPEG-7 description and Descriptions information. The latter can be either the complete description of the multimedia content or fragments of the description. 

MPEG-7 data can be represented either in textual, binary or a mixture of the two formats, depending on application requirements. The standard defines a unique mapping between the binary format and the textual format. A bi-directional loss-less mapping between the textual representation and the binary representation is possible, but this need not always be used. Some applications may not want to transmit all the information contained in the textual representation and may prefer to use a more bit-efficient binary lossy transmission. The syntax of the textual format is defined by the DDL and the syntax of the binary format, called  Binary format for MPEG-7 data (BiM) was originally defined in Part 1 (Systems) of the standard but was later moved to Part 1 of MPEG-B. 

At the compression layer, the flow of AUs (either textual or binary) is parsed, and the content description is reconstructed. The MPEG-7 binary stream can be either parsed by the BiM parser, transformed into textual format, and then transmitted in textual format for further reconstruction processing, or the binary stream can be parsed by the BiM parser and then transmitted in proprietary format for further processing. 

AUs are further structured as commands encapsulating the schema or the description information. Commands allow a description to be delivered in a single chunk or to be fragmented in small pieces. They allow basic operations such as updating a D, deleting part of the description or adding a new DDL structure. The reconstruction stage of the compression layer updates the description information and associated schema information by consuming these commands. Further structure of the schema or description is out of the scope of the MPEG-7 standard.


MPEG-7 Development

The Life in ISO page tells about my efforts in overcoming the many obstacles on the road to establishing Subcommittee 29 (SC 29), because of the “lupissimus” nature of the organisations operating under the ISO folds, and make it the video, audio, multimedia and hypermedia coding SC. I did this because I felt the need of a higher-level group than a WG with the task to plan, organise and shield – yes, again – the work of the technical groups against the wolves ambulating in circle. I had great hopes that SC 29 would deal with these issues with the same aggressive – in a positive sense, of course – spirit as MPEG. 

In 1993, two years after its establishment, I had begun to inquire what would be the future of SC 29. The SC 29 meeting in Seoul, on the three days following the MPEG meeting, provided the opportunity to make a proposal, triggered by the growing excitement that the television industry of that time, especially the cable television industry, was feeling because MPEG-2, with its ability to put more than 5 times more programs in their analogue transmission channels, was becoming real. My reasoning was: if people will be confronted with an offering of 500 simultaneous programs and if the indexing of programs remained the same as used with a few programs, then how on earth would consumers ever become aware of what is on offer? 

At the Seoul meeting, the Italian National Body made a number of proposals. One of them recited:

the success of future multimedia services/applications can be assured only if a scalable man/service interface is defined that is capable to deal with the simpler services/applications of today and be easily upgraded to deal with the sophisticated multimedia ones in the future. 

At that time I had no intention of getting MPEG in this kind of work – the group had enough to chew on its own with MPEG-2 and MPEG-4 – so I proposed that MHEG make a study on the subject and report back at the following meeting one year later (I know, this does not look very much “internet age”, but that was 1993). With my engagement in DAVIC starting immediately after that meeting, I hope I can be at least understood if not excused for having no time to follow the work that MHEG was expected to do during 1994. That was a mistake, because at the Singapore meeting one year later the MHEG chair reported that the study had not been made, yet. So they were given another year. After the Singapore meeting my life became, if possible, even more hectic and the matter then fell into oblivion. 

With my departure from DAVIC in December 1995, I had enough time to rethink the role of MPEG and at the Florence meeting in March 1996, I revisited my 30-month old proposal and raised it within MPEG, of course augmented with the experience gained during that time span. At first the proposal did not go through very well with the members. At that time the MPEG community was still massively populated with compression experts and the idea of dealing with “search interfaces” was too alien to them. With perseverance, however, a core group of people understanding and supporting the new project began to form and at the Chicago meeting in September/October 1996 the group had come to the conclusion that this was indeed something MPEG had to do. 

Unlike what MPEG had done until that time, the next work item would deal with “description” of audio and video. Description is nothing else than just another form of information representation, and SGML and XML already provide the means to represent textual descriptions. The real difference was that, in the standards that MPEG had developed or was planning to deveop, information was represented for the purpose of feeding it to human beings and it was natural that the fidelity criterion for the representation should be based on a measure of “original vs. decompressed” distortion. A movie is still seen and heard by human eyes and ears, and therefore must look and sound “good”, Signal to Noise Ratio (SNR) in its many forms being one measure somehow related to how much a particular piece of content looks similar to the original and hence “good” (from the technical quality viewpoint). On the other hand “descriptions” were meant for use by machines that could only process information if it was in a form understood by them with the ultimate goal of showing something to the end user that might be only related to the descriptions. In the process, the original emphasis on “interfaces” was – rightly – lost and the core technology became “digital representation of descriptors”. The title of the standard, however, is “Multimedia Content Description Interface” and still makes reference to interfaces. 

Finding the “what” had been – so to speak – easy, even though it had taken some time to get there. Finding the MPEG “number” to identify it, not equally so. The numbers of previous MPEG standards had been assigned in a sequential logic that a combination of technology, politics and other factors had disrupted. One school of thought leaned toward MPEG-8, the obvious sequential logic extension to powers of 2, while some others wanted to resume the normal order and call it MPEG-5. I personally favoured the former, but one morning Fernando Pereira, who had been my Ph.D. student at CSELT in the late 1980s and had been attending MPEG since the official start of MPEG-4 in late 1993, called me and proposed MPEG-7. Those in MPEG who knew him well will realise that this proposal was perfectly in line with his “little rascal” attitude that he has always taken towards this kind of things. So the number stuck. 

The year 1997 was largely spent working out the requirements, going through the usual steps of identifying applications. One major application example of MPEG-7 is multimedia information retrieval systems, i.e. systems that allow for quick and efficient search of various types of multimedia information of interest to a user in, say, a digital library. When storing the data themselves, also the descriptions – the metadata – would be stored in such a way that searches could be carried out. 

MP7_query

Figure 1 – Search using MPEG-7

Another example is filtering in a stream of audiovisual content descriptions, the famous “500-channel” use case in which the service provider would send, next to the TV programs, appropriate descriptions of them. A user could instruct his set top box or recording device to search those programs that would satisfy his preferences. In this way the device would scan for, say, all types of news with the exclusion of, say, sports. 

A third example is made of applications based on image understanding techniques, such as those found in surveillance, intelligent vision, smart cameras, etc. A sound sensor would produce sound not in the form of samples but, say, loudness and spectral data, and an image sensor might produce visual data not in the form of pixels but in the form of objects, e.g. physical measures of the object and time information. These could then be processed to verify if certain conditions were met and, in the positive case, an alarm bell would ring. 

The last example in this list is media conversion, such as one that would be used at an infopoint. Depending on user queries and the type of user device, the system would generate responses by converting elements of information from one type to another using the semantic meaning of each object. 

The examples given confirm the original assumptions: in MPEG-7 audio and visual information would be created, exchanged, retrieved and re-used by computational systems where humans would not necessarily deal directly with descriptors, but only consume the result of the processing effected by an application. Humans would probably make queries in manifold forms such as by using natural languages or by hand-drawing images, or by selecting models or by humming. The computer would then convert the queries into some internal form that would be used to execute whatever processing, based on MPEG-7 data, was needed to provide an answer. But both the input and output processing would be outside of the standard, as MPEG is only concerned by “interfaces”..

Here are some examples of how queries could be formulated for different types of media.

Media Type Query Type
Images Draw color and texture regions and search for similar regions
Graphics Sketch graphics and search for images with similar graphics
Object Motion Describe motion of objects and search for scenes with similarly moving objects
Video Describe actions and retrieve videos with similar actions
Voice Use an excerpt of a song to retrieve similar songs/video clips
Music Play a few notes and search for list of musical pieces containing them

The first view of MPEG-7 was very much driven by the signal processing background of most MPEG members and addressed the standard descriptions of some low-level audio and video features. Video people would think of shape, size, texture, colour, movement (trajectory), position, etc. Audio people would think of key, mood, tempo, tempo changes, position in sound space, etc. But what about a query of the type: “a scene with a barking brown dog on the left and a blue ball that falls down on the right, with the sound of passing cars in the background”? To provide an answer, it was clear that the system had to make use not only of low-level descriptors, but also of higher-level information as depicted in Figure 2. 

MP7_layers

Fig. 2 – Information levels in MPEG-7

Examples of information elements for each levels are given in Tab. 1.

Table 1 – Information elements for each level

Data Signal Structure Features Model Semantics
Images
Regions Color Clusters
Objects
Video
Segments Texture Classes Events
Audio
Grids Shape Collections Actions
Multimedia Mosaics Motion Probabilities Video
Formats
Relationships (Space-time) Speech Confidences People
Layout Timbre Relationships
Melody

This recognition triggered a large-scale invasion of IT world experts into MPEG, in addition to a smaller-scale invasion of experts from another field, metadata, a term that has become fashionable and means data about data, including data that assist in identifying, describing, evaluating and selecting information elements that metadata describe. At the time MPEG-7 was starting, some metadata “initiatives” were already under way or starting, such as the Dublin Core, P/META, SMPTE/EBU, etc. All these, however, were sectorial in the sense that they were driven by a specific industry. 

While the requirement work was progressing, MPEG started an action to acquire audio and visual data that could be used to test submissions in the competitive phase and to conduct core experiments in the subsequent collaborative phase. The MPEG-7 content set, as assembled at the Atlantic City meeting in October 1998, is composed of three main categories: Audio (about 12 hours), Still images (about 7000 images, 3000 of which were trademark images from the Korean Industrial Property Office) and Video (about 13 hours). Other types of content were assembled in subsequent phases. The MPEG-7 content set also offered the opportunity for MPEG to have direct experience of “licensing agreements” that right holders made for the use of audio and visual test material. 

Licensing terms for content used in the development of MPEG standards had varied considerably. The CCIR video sequences, obtained through the good offices of Ken Davies, were released “for development of MPEG standards”. The MPEG-1 Audio sequences were extracted from the so-called EBU SQAM CD and were affected by similar conditions. Most of the MPEG-2 Video sequences – high-quality video clips at TV resolution – had stricter usage conditions. Also most MPEG-4 sequences had well-defined conditions. For the first time, use of the MPEG-7 content set was regulated by an agreement drafted by a lawyer and that had to be signed by those who wanted to use that content. 

The well-tested method of producing a CfP to acquire relevant technology was also used this time. The MPEG-7 CfP was issued in October 1998 at the Atlantic City meeting and the University of Lancaster, via the good offices of Ed Hartley, the the UK HoD at that time, kindly hosted a meeting of the “MPEG-7 Test and Evaluation” AHG at the following February 1999 meeting. The total number of proposals, submitted by some 60 parties, was almost 400 and the number of participants was almost as large as the one of a regular MPEG meeting.

After an intense week of work, during which fair experts comparisons between submissions was made – without subjective tests for the first time, in keeping with the “computer-processable data” paradigm – the technical work could start. 

It took some time to give a clean structure – no longer precisely one and trine this time – to the MPEG-7 standard. Eventually it was agreed that there would be a standardised core set of Descriptors (D) that could be used to describe the various features of multimedia content. Purely Visual and purely Audio Descriptors would be included in the Visual and Audio parts of the standard, respectively. Then pre-defined structures of Descriptors and their relationships, called Description Schemes (DS) would be included in the Multimedia Description Schemes part of the standard. This part would also contain Ds that were neither Visual nor Audio. DSs would need a language – called Description Definition Language (DDL) – so that DSs and, possibly, new Ds could be defined. Lastly there would be a bit-efficient coded representation of descriptions to enable efficient storage and fast access, along with a “Systems” part of the standard, that would also provide the glue holding together all the pieces. 

DDL-DS-D

Figure 3 – Descriptors, Description Schemes and Description Definition Language in MPEG-7

Work progressed smoothly until the July 1999 meeting in Vancouver. At that time it was realised that the organisation of the group, based on Requirements, DMIF, Systems, Video, Audio, SNHC, Test, Implementation Studies and Liaison, that had remained unchanged since July 1996, was no longer a good match to the new context. One viewpoint was that DSs could be considered as an extension of the MPEG-4 multimedia composition work and responsibility for it could have been taken over by the Systems group. On the other hand, during all of the MPEG-4 development there had been a hiatus, not really justified by technical reasons, between the Systems and DMIF work that had led to the publication of two different parts of the standard, while they could have been combined or published more rationally with a different split of content. If this rift were to be mended, the merger of the Systems and DMIF groups would be justified. 

After a very long discussion at the Chairs meeting – helped by the long summer days of Vancouver – it was concluded that the best arrangement would be to bring the Systems and DMIF work together and to establish a new group called Multimedia Description Schemes (MDS). Philippe Salembier of the Polytechnic University of Catalunya was appointed as chairman of that group. At the same meeting the DSM group, one of the earliest MPEG groups, and the originator of the DSM-CC and DMIF standards, was disbanded after 10 years of service. 

Several solutions were available when the DDL development started but none of them was considered stable enough. The initial decision made by MPEG was to develop its own language whilst keeping track of W3C’s XML Schema developments. The idea was to use the work done in W3C if ready, but to have a fallback solution in case the W3C work was delayed. In April 2000 the improved stability of XML Schema Language, its expected widespread adoption, the availability of tools and parsers and its ability to satisfy the majority of MPEG-7’s requirements, led to the decision to adopt XML Schema as the basis for the DDL. However, because XML Schema was not designed specifically for audiovisual content, certain specific MPEG-7 extensions had to be developed by MPEG.

MPEG-7 obviously needed its own reference software, called eXperimentation Model (XM) software and Stefan Herrmann of the University of Munich was appointed to oversee its development. The XM is the simulation platform for the MPEG-7 Ds, DSs, Coding Schemes (CS), and DDL. MPEG-7 followed the path opened by MPEG-4 for reference software with some adjustments. Proponents did not have to provide reference software if the rest of the code utilised an API documented and supported by a number of independent vendors or if standard mathematical routines such as MATLAB were involved or if it was possible to utilise publicly available code (source code only) that could be used without any modification. Also, as was already the case of MPEG-4, there was no obligation to provide very specific feature extraction tools.

The MPEG-7 standard was approved at the Sydney meeting in July 2001, at the longest ever MPEG meeting that approached 24:00. At that time the standard included the 3 parts Systems, DDL, Video, Audio and MDS. Very soon Reference Software (part 6) and Conformance (part 7) were published. Other MPEG-7 parts are Extraction and Use of MPEG-7 Descriptions, Profiles, Schema definition, Profile schemas, Query Format and Compact Descriptors for Visual Search. More about the last two later.


Machines Begin To Understand The World

The increased dependency of many users, particularly on mobile devices, has initiated an unstoppable drive, supported by their increasing interaction and processing capability, to pack more functionalities in those devices. In some cases the functionality is entirely confined within the device, but in other cases the ability to interact with other devices or services depends on an agreed – i.e. standard – communication interface.

Users can already send the audio signature of a song captured from the air to and receive ack all sort of information regarding the song from a service. This is something that can already be implemented in a standard fashion using the MPEG-7 Audio Descriptors.

In the “image” domain MPEG has made progress compared to the “old” MPEG-7 Visual capabilities. This has been achieved with two amendments of MPEG-7 Visual: image signature tools and video signature tools. These descriptors provide a “fingerprint” that uniquely identifies image and video content. They are robust, in the sense that their value is not affected, across a wide range of common editing operations. However, they are also sufficiently different for every item of “original” content to allow unique and reliable identification of the image or the video.

Image Signature is a content-based descriptor designed for the fast and robust identification of the same or modified image on the web-scale or in databases. Also known as fingerprint, Image Signature has a strong advantage over watermarking techniques in that it does not require any modification of the content and can be used readily with all existing content. Image Signature combines two complementary approaches in image representation:

  1. A global signature, where the signature is extracted from the entire image
  2. A local approach, where a set of local signatures are extracted at salient points in the image.

Image Signature has been tested to a wide range of common modifications, such as text/logo overlay, rotation, cropping, colour changes, etc. achieving an overall success rate of ~99.29% at a false alarm rate of less than 0.05 parts-per-million for the global signature, and ~98.04% at a false alarm rate of less than 10 parts-per-million for the complete signature. Search speed is in the order of 80 million and 100,000 matches per second for the global and complete signatures respectively. The Image Signature is also extremely compact as it requires only 1024 bits per image for the global signature and up to 7424 bits for the complete signature..

Video Signature enjoys similar features as Image Signature. Key technical aspects area combined dense (video-frame-level) and sparse (video-segment-level) description approach, allowing flexible multi-stage matching schemes, and a custom descriptor compression scheme, to facilitate efficient storage and transmission of the Video Signature metadata. Video Signature as been tested to a wide range of commonly performed modifications, e.g. text/logo overlay, camera capture (camcording), compression al low bitrates, resolution reduction, frame rate changes, etc. achieved an overall success rate of ~95.49% at a false alarm rate of less than 5 parts-per-million. Video Signature allows for very high extraction and matching speeds, and very low storage and transmission requirements, at only ~2MB per hour of video content.

Another important example is provided by “Visual Search” which is usefully represented by the following use case. A user who wants to know more about an object takes a picture of it and sends it to a service . The service interprets the object, uses this interpretation to search in a knowledge data base and possibly responds with a number of additional information elements from a variety of viewpoints.

MPEG-7 Visual already provides a number of descriptors for video, audio and multimedia that are useful for this task. However, more is needed to solve the challenging problem of enabling users to obtain the desired information by using both their mobile and non-mobile devices to take a picture, extract sophisticated descriptors from the picture without depleting the battery, and send the data to the service provider of choice without clogging the network. This is the task of the Compact Descriptors for Visual Search standard (MPEG-7 Part 13).


The Impact Of MPEG-7

Metadata are an important ingredient for any business strategy in the media field, both at the industry and individual company level. At the industry level this can be seen from the large number of initiatives that have developed or are still in the process of developing metadata standards, because it is hoped that metadata serving the interests of an industry will prop up that industry’s business. At the company level this can be seen from the attempt of many pay TV Service Providers (SP) to offer access to their content through proprietary Electronic Program Guide (EPG) technologies. The other main application domain, streaming video on the Internet, does show that meaningful business models exist, to the extent that many services providers feel compelled to have a presence in that domain. While there is hope that such sophistications as metadata can be finally put in the picture, there is more than a possibility that the internet will become little more than a place where you can start offering services à la CATV, with minimal variations of business model.

This may explain why there are still few cases of significant uses of MPEG-7 in the marketplace. The pay TV model forces every provider to make all investments that are needed for an effective access to AV information. As a result every SP has very primitive forms of content access. On the other hand, it is true that, in spite of the confusion surrounding the word “convergence”, more content reaches the end users through multiple channels, each of which used to be controlled by a single industry. Content Providers who will want to exploit multichannel distribution opportunities for their content will need metadata that are truly generic, and there is no other metadata standard but MPEG-7 with this feature.

The TV Anytime Forum (TVA) is an organisation MPEG has dealt with in the latest phases of MPEG-7. TVA was one splinter group of the latter-day DAVIC work, when the management of that body had proposed the continuation of DAVIC beyond the 5-year date with a program of work centred on two projects: “TV Anytime” and “TV Anywhere”. The DAVIC membership did not support the continuation of the organisation but the idea of “TV Anytime” continued as a loose organisation of interested companies. The original idea of TV Anytime was that of a TV receiving device, like some products currently available, that has storage capacity and therefore could extend the use of the broadcasting channel by letting the receiver “pick” programs of interest to the user when they are actually transmitted so that the user could watch those programs when he had time for it. To achieve its goal TVA adopted a significant set of MPEG-7 technologies in its specifications.

At the completion of version 1 of MPEG-7 in July 2001 there was talk of setting up an MPEG-7 Alliance (MP7A), an organisation with a scope similar to M4IF. Eventually it was found more convenient to extend the scope of M4IF to host MPEG related interests. The proposal was accepted at the June 2003 M4IF General Assembly. The organisation continued with the new name of MPEGIF, but it merged with the Open IPTV Forum (OIPF) in 2013. OIPF itself merged with the HbbTV Association.


A World Of Peers

My sudden, although not unplanned, departure from DAVIC gave me the opportunity to realise some ideas I had been mulling over since some time. With MPEG and DAVIC I had dealt with machines that process data in a sophisticated, well-programmed way, but as intermediaries between humans or, at most, between humans and machines. What if the machines themselves processed (à la MPEG-7) and exchanged data autonomously – of course within the purview set by humans – and with other machines? That was a bold undertaking, that would possibly involve some new industries beyond those I had become acquainted to work with, but I thought that I had developed a good recipe for creating technical standards enabling new opportunities from unexpected directions. 

A few days after the conclusion of the DAVIC meeting in December 1995 in Berlin I was already deep in Christmas holidays and had advanced quite substantially in the development of the idea. In a letter to some friends at the beginning of January 1996 I disclosed the first ideas of the mission of a new organisation that I called Foundation for Intelligent Physical Agents (FIPA): 

Important new domains of economic activity that retain growth in the 1990’s and beyond can be created by assembling technologies from different industries, as in the case of multimedia. Intelligent Physical Agents, i.e. devices for the mass market, capable of executing physical actions under the instruction of human beings, with a high degree of intelligence, will be possible if existing technologies are drawn from the appropriate industries and integrated. The integration task will be executed by an organisation, operating under similar rules as MPEG and DAVIC, where interested parties can jointly specify generic subsystems and the supporting generic technologies.

In the period between January to March 1996, documents were circulated to a selected number of individuals who contributed to the refinement of the basic ideas. Gradually the discussion focused on the idea of “agents”, entities that reside in environments where they interpret “sensor” data that reflect events in the environment, process them and execute “motor” commands that produce effects in the environment. 

The notion of “agents”, namely software entities with special characteristics, had been explored by the academic community for a number of years. Fig. 1, drawn by Abe Mamdani of Imperial College, gives a pictorial representation of this concept. 

agent_model

Figure 1 – A model for agents

The letter A represents an agent, an entity that can interact with the environment, constituted of software (possibly interfaced with hardware), other agents and humans. The different sources of information are “fused” and processed, possibly to further interact with the environment.

Agents are charcaterised by a number of feartures:

Feature Description: able to
Autonomous Operate without the direct intervention of humans
Social Interact with other agents and/or humans
Reactive Perceive the environment and respond to changes occurring in it in a timely fashion
Pro-active Exhibit goal-directed behaviour by taking the initiative
Mobile Move from one enviroment to another
Time-continuous Run processes continuously, not “one-shot” computations that terminate
Adaptive Adapt automatically to changes in their environment

Richard Nicol, whom I had worked with during 8 years in the COST 211 project where he was chairman of the Hardware Subgroup, in the H.100 series of CCITT Recommendations and in the first phases of the Okubo group, was at that time the deputy director of research at British Telecom Laboratories (BTL). In his organisation he had set up a sizeable research group working on agents and responded favourably to my proposal to host the first meeting. This was held at Imperial College in London in April 1996 and was attended by 35 people from 7 countries. 

The meeting agreed that it was desirable to establishan international organisation tasked to develop generic agent standards based on the participation of all players in the field, developed the basic principles of generic agent standardisation, and agreed on a work plan calling for a first specification of a set of agent technologies to be achieved at the end of 1997. 

IBM hosted the second meeting at their TJ Watson Research Center in Yorktown, NY. The meeting was attended by 60 people and organised as a workshop with some 40 presentations dealing with the different areas of “Input/Output primitives”, “Agents”, “Communication” and “Society of Agents”. The meeting output was a framework document for FIPA activities, preliminary drafts of FIPA applications and requirements, and a list of agent technologies that were candidates for a 1997 specification. 

As a result of one of the Yorktown resolutions, in September 1996 five individuals met in Geneva in the office of Me Jean-Pierre Jacquemoud, the same who had handled the papers for the registration of DAVIC, to sign the statutes establishing FIPA as a not-for-profit association under Swiss Civil Law, much as had been done for DAVIC and would be done later for M4IF. According to the work plan, the third meeting was held in Tokyo in October and hosted by NHK. This was expected to produce a Call for Proposals and the fact that this did indeed happen in the short span of 5 work days (and nights) was a remarkable achievement. 

The submissions received from attendees were used to identify twelve different application areas of interest, out of which 4 were retained: Personal Assistant, Personal Travel Assistance, Audio-Visual Entertainment and Broadcasting, and Network Provisioning and Management. From the applications the list of agent technologies needed to support the four applications was identified. Then the list of technologies was restricted to those for which specification in a year’s time was considered feasible. Finally, the general shape of applications made possible by FIPA 97 was described. At the end of this long marathon, the CfP was drafted, approved and released. 

All this had been interesting, and personally rewarding because the same spirit that drives the work of MPEG had been recreated in an environment with people from very different backgrounds and with almost no overlap with MPEG and DAVIC. But that was just the preparation for the big match at the January 1997 meeting, hosted by CSELT in Turin. The 19 submissions received provided sufficient material to start the work, in particular it was possible to select a proposal for Agent Communication Language (ACL), but a second CfP was issued because some more necessary technologies were identified. 

During a set of exciting one-week long meetings held at quarterly intervals and at alternating international locations, FIPA succeeded in developing the first series of specifications – called FIPA 97 – and ratifying them in October 1997 at the Munich meeting hosted by Siemens. 

The achievement of this first milestone had immediate responses from industry. Several companies began developing FIPA 97-conforming platforms or adapting their previous development to the FIPA 97 specification. Some of the platforms have been made open source and have acquired a large number of followers and users. At the January 1999 meeting in Seoul, FIPA conducted the first interoperability trials. Four companies, one of them being CSELT, with different hardware, operating systems, ORBs and languages, successfully tested interoperability and compatibility of their agent platforms by connecting them in a single LAN. 

The following year another set of specifications was developed and approved as FIPA 98 in October at the Durham, NC meeting hosted by IBM.

One year later, at the October 1999 meeting in Kawasaki hosted by Hitachi, I was again confronted with a new concentration of commitments, particularly coming from SDMI, that demanded a large share of my time. I decided not to stand for re-election at the Board of Directors. 

A major rewrite of past FIPA specifications was carried out and approved as FIPA2000. A new specification, started in April 1999, called Abstract Architecture has been developed of which the rewriting of FIPA 97 as FIPA 2000 is a “reification”.

I would like to spend a few words describing the content of the FIPA standard and how this can benefit the wide range of communities that were the original FIPA target. FIPA is essentially a communication standard that specifies how agents and Agent Platforms (AP) communicate between themselves. To achieve this there is a need to specify a model of an AP. In Fig. 2 one sees that an AP is composed of 4 entities:

  1. one or more agents
  2. an Agent Management System (AMS)
  3. one or more Directory Facilitators (DF)
  4. one or more Message Transport Systems (MTS).

communicating_agents

Fig. 2 – Communicating agents and agent platforms

One should not confuse an AP with a single computing environment: it may very well be that two or more APs reside on the same computer or that the AP is distributed on a number of computers. 

According to FIPA, an agent is an active software entity designed to perform certain functions, e.g. interacting with the environment and communicating with other agents via the ACL. The AMS is the agent that supervises access to and use of AP resources, maintains a directory of resident agents (so-called white pages) associating logical agent identifiers to state and address information, and handles the agents’ life cycle. The DF is an agent providing yellow page services with matchmaking capabilities. 

So the major elements of the FIPA standard are three specifications:

  1. AP specification
  2. MTS (APs communicate)
  3. ACL (how agents exchange messages). 

My group at CSELT put considerable efforts in the implementation of FIPA specifications. The most visible result of the effort, driven by Fabio Bellifemine, a key figure in FIPA since the early meetings, has been the development of the Java Agent Development Environment (JADE), an open source project using the Lesser General Public License (LGPL) license with a large community of users spanning the entire spectrum of communication industries, academia included. Started as an internal CSELT project, JADE was significantly enhanced thanks to the CEC-funded IST project LEAP. As a result of a huge collaborative effort, this project has provided a version of JADE suitable for memory-constrained devices such as the new generation cellular phones. 

JADE can be utilised in all environments where complex tasks, whose execution depends on a large number of unpredictable events, have to be carried out. This may be the case of a large factory where there is the need to optimise the process of delivering spare parts to warehouses, from there to the assembly lines, and the dispatching of products from there to warehouses and from there to customers.

Another example is offered by the coordination of a large work force scattered on a geographical area, such as found in utilities. In Figure 3

  1. Customers report intervention needs to a call centre
  2. A team manager supervises the work force elements (field engineers)
  3. Agents located on mobile devices retrieve jobs and trade jobs and shifts, but also receive guidance from a route planning system and retrieve documentation
  4. The team manager’s role applies policies and checks progress instead of the tedious task of allocating jobs.

work_force_coordination

Fig. 3 – Work force coordination

What JADE enables is a straightforward implementation of Peer-to-Peer (P2P) communication. The client-server communication paradigm assumes that intelligence and resources are concentrated in the server which is just reactive to clients’ requests. Clients have little intelligence and resources but takes the initiative to request services from the server. The P2P reverses the client-server paradigm by making the peer capable of playing both roles.

This last case allows me to depict a realistic scenario, in which JADE lets a set of agents resident in (multiple) mobile communication devices transform the mobile device into the user’s alter ego. In Fig. 4 the agent possesses domain knowledge, learns from the user by examples, observes and learns how the user interacts with the device and can even request advice from its peers. 

personal_assistant_metaphore

Fig. 4 – Devices as user’s alter ego

JADE allows the easy creation of communities driven by the deep understanding that personal agents can build on their masters by observing their actions. Gone would be the days when my alter ego was out of my control and in the hands of some SP. A key technology to practically implement this paradigm is clearly the ability to learn from user actions.


Technology Challenging Rights

Rights to anything are defined by the public authorities of a country in which the thing is located. Rights to a literary or artistic work are no different and are only applied to works in a particular country. As this did not promote their commerce, the Berne Convention for the Protection of Literary and Artistic Works was established in 1886 to create a basic set of rules valid in the countries that signed the Convention, hence beyond national borders. The architecture of the Convention is heavily influenced by the “author’s rights” approach to protection and indeed the United States only joined the Convention in 1988, a little more than one century after its establishment.

Besides a broad definition of “Literary and artistic works” that applies to every production in the literary, scientific and artistic domain using a variety of expressions (Art. 2.1), the Convention sets a number of important principles, e.g.:

  1. The author has the right to claim authorship of the work and to object to any distortion or mutilation that would be prejudicial to his honour or reputation (Art. 6 bis – 1).
  2. Different media are protected for different periods of time (Art. 7).
  3. Authors have the exclusive right to authorise the reproduction of their works, but reproduction of such works in certain special cases is permitted (Art. 9).
  4. Quotations from a work may be made available to the public (Art 10-1).
  5. Works may be used by way of illustration in publications, broadcasts or sound or visual recordings for teaching (Art. 10-2).

Other international treaties relevant to rights of literary and artistic works are the Universal Copyright Convention of 1952, the International Convention for the Protection of Performing Artists, Producers of Phonograms and Broadcasting Organisations (Rome Convention) of 1961 and the Convention for the Protection of Producers of Phonograms Against Unauthorised Duplication of their Phonograms of 1971.

Often it makes practical sense to uphold a “right”, either as nationally or internationally defined, only when there are practical means to enforce it. This was not really an issue in the early times because works, first books but later other media as well, were distributed on physical carriers and their duplication required costly equipment that only professionals could afford (but this was not necessarily an impediment to printers who reprinted books printed in England and triggered Queen Anne’s Act). Keeping professional “pirates” in check was a task for which the traditional setting of courts of law and police forces was adequate, and the infringement of the author’s or publisher’s rights could be kept below an acceptable threshold.

Progress of technology, however, gradually reduced the entry level to those wanting to make copies. Thus the law was forced to deal with the lowering thresholds by introducing “exceptions” so that persons making a copy of a few pages of a book, or of a song, would not automatically be reduced to criminals and manufacturers of copiers, recorders and cassettes would not automatically become their accomplices. Even though rights holders feared that the home recorder had become a potential tool for mass piracy, the reality of analogue technology was such that the quality of copies of records and films was worse than the original and fast degrading at each copy. Actually the enhanced ease of distribution promoted knowledge of the work and prompted others to buy an original.

Therefore most European countries grant consumers the right to make private copies, the principle being that these are not likely to compete with, and so reduce the market for, the original works. At the same time, however, a levy was applied on recording equipment, including blank tapes. The proceeds of this levy went to rights holders in order to compensate them for the “loss of revenues” caused by such private copying. In the United States the Audio Home Recording Act (AHRA) grants consumers the ability to make private copies of broadcast music.

The US Copyright Law adopted the notion of “fair use”, a compromise between the strict application of the publisher’s rights and a “reasonable” use of the work. This is to be assessed on the basis of four parameters:

  • Character of the Use: e.g. for educational or non-profit purpose
  • Nature of the Work: e.g. the work is factual as opposed to being creative
  • Portion of Work Copied: e.g. a small portion of the work
  • Effect on the Market Value of the Work: (e.g. small impact on the value of the work as a consequence its use).

Probably the first encounter of content with digital technologies under the auspices (so to speak) of the law happened in the US Congress with the AHRA. This was triggered by the appearance on the market of the Digital Audio Tape (DAT), a digital audio recording device introduced by Sony in 1987. This device was capable of recording up to 2,048 kbit/s of data in real time and an obvious, although not unique, use of the DAT was seen for copying the stream of bits leaving a CD player. This was of concern to the music industry, because of the ability of the DAT to make perfect copies, a major departure from the self-degrading feature of the Compact Cassette. Therefore the AHRA made it mandatory that digital recording devices like the DAT be equipped with a Serial Copy Management System (SCMS) that would allow to make only one copy of a given digital recording device. An apparently innocent footnote, but something that the IT industry took care that was clearly contemplated, was that the AHRA only applies to CE devices, not to IT devices.

Fearing the arrival of more digital technologies, with their ability to allow a limitless number of perfect copies of a digital original, the World Intellectual Property Organisation (WIPO) started work designed to extend some provisions of the Berne Convention and provide responses to the challenges brought about by Information and Communication Technologies ICT). In December 1996 the Diplomatic Conference on Certain Copyright and Neighbouring Rights Questions adopted two Treaties, namely the WIPO Copyright Treaty and the WIPO Performances and Phonograms Treaty. Some of the features of these Treaties were:

  • Computer programs are protected as literary works.
  • Compilations of data or other material constitute intellectual creations.
  • Authors of computer programs, cinematographic and phonographic works have the exclusive right of authorising commercial rental of their works.
  • Authors have the exclusive right of authorising any communication to the public of their works by wire or wireless means.
  • States that are party to the treaty (Signatories) provide legal remedy against those who alter Rights Management Information, i.e. information which identifies the work, the author, the rights owners, information about the terms and conditions of use, and any numbers or codes that represent such information.
  • Signatories have an obligation to make unlawful any device, product or component incorporated into a device or product, the primary purposes or primary effect of which is to circumvent any process, mechanism or system that prevents or inhibits the exploitation of rights holders’ rights.

The treaties were converted into national laws by many Signatories. In the USA, the Digital Millennium Copyright Act (DMCA) was enacted in 1998 and in Europe the European Directive on Copyright and Related Rights in the Information Society was adopted in 2001. The Copyright Directive, however, had to be converted into national laws by the national parliaments of EU Member States.

No matter how well the future is prepared for, something unexpected is always bound to happen. No better example of this lowly sample of wisdom can be found than MP3. The story started in November 1992 when the then still largely unknown ISO WG called MPEG approved MPEG-1. The Layer I and Layer II portions of that standard had very clear application targets: Digital Compact Cassette (DCC) and Digital Audio broadcasting (DAB), respectively. The performance of Layer III was significantly superior to the other two Audio Layers but the algorithm was deemed by some MPEG experts to be “too complicated” for consumer applications and for some time its use was confined to professional environments.

Towards the mid 1990s, however, a number of concurrent events occurred:

  • The number of PC audio boards started peaking up and listening to high-quality music from a CD using a PC became commonplace, at least among computer aficionados.
  • Microsoft launched Windows 95 where for the first time it became easy for anybody to copy a music track from CD onto a hard disk by “drag and drop” on the Windows 95 GUI.
  • The number of Intel Pentium processors peaked up, thus providing a generation leap in the computational capability of a PC;
  • The MPEG-1 Audio Layer III reference software was adapted and optimised for execution on the Pentium;
  • The role of the Internet began to change from a communication tool that affected mostly academia and research to a mass communication tool.

In a matter of months, more and more PC users also became MP3 users, as MPEG-1 Audio Layer III was soon christened. With good encoding software, a music track of average duration could be reduced from the utterly unmanageable – at that time – 40-50 Mbyte per track to a still imposing but reasonable 3-4 Mbyte. There were wide variations in music quality depending on the encoder and the bitrate used, but apparently users did not care so much because it was not just music quality they were interested in, but new music experience.

Users began to rip off music tracks from their CDs, compress them, and store them on their PC hard disks. Enthusiasts could make compilations to their own, not the record companies’ liking. With the ever-increasing bitrate made available by telephony modems, users soon discovered that it was also easy to send MP3 files to their friends as email attachments or by other means.

Some entrepreneurs immediately saw the opportunities offered by the new technology. The first portable MP3 player, the RIO, manufactured by Diamond Multimedia, was a technology jewel. Made up of the necessary amount of flash memory to store the MP3 files and a DSP for decoding the bitstream, D/A conversion and computer interfacing, the whole thing weighed just a few tens of grams, most of them coming from the device case. Users could upload their MP3 files on their RIO players and listen to their songs the way they liked with a freedom never enjoyed before. This was a little more than 5 years after MPEG-1 was approved while some were saying that MPEG-1 Audio Layer 3 was too complex!

Other Web “enthusiasts” saw lucrative uses of the new technology. By ripping off thousands of music files from CDs, compressing them in MP3 and posting them on their web sites, they could lure internauts to their sites, where they would be encouraged to download music files for free. While the files were slowly downloading, the visitor would be exposed to advertisements posted on the site and the advertisement space could be sold to interested companies. Some other enthusiastic entrepreneurs saw the opportunity to host MP3 files on their servers. Users could prove by some means that they had legitimate possession of a CD (already a considerable progress in respecting the value of music assets) and have the tracks of that CD accessible anytime and anywhere to them.

The most enthusiastic entrepreneurs of all thought that they could build a community of music fans who would contribute a portion of their hard disk space to store a portion of the community’s music. A software was developed implementing a protocol, later to be called Peer-to-Peer (P2P) to be installed on users’ computers so that they could “share” the MP3 files present on their hard disks with other users who had installed the same software. The entrepreneurs would then “just” manage a directory that listed which file was available where. People would access the directory, get the information about the closest peers and download the MP3 files they were interested in directly from that peer. Some P2P protocols did not even require a central directory service to operate.

Even artists, both well established and new, tried the chance of getting in direct touch with their fans by posting their own MP3 songs on their websites. But it is one thing to use the web as an advertisement medium and quite another to use it as a place to post wares for free. Artists could possibly become famous, but they would remain penniless unless one could find a smart idea.

The history of the web so far has taught some important lessons, even if one considers some limitations – mostly bandwidth – and the history of MP3 in particular. Artists are keen to be able to access their public directly. Consumers are keen to be able to find the songs they are interested in on the web, download them and listen to them when and how they like. Some intermediaries – the lazy ones, I mean – fear the disintermediation brought by the web, because they see it as leading to the annihilation of their role. Other intermediaries – the smart ones, I mean – know very well that web disintermediation is nothing more than an urban legend, because new forms of intermediation simply overtake the old ones. 

It is understandable that people owning the rights to hundreds of thousands of songs felt nervous and that for them MP3 was not the tool that provided new experiences in music and made people happy, but a tool that could be used to rob rights holders of their property. The Recording Industry Association of America (RIAA), the trade association of the USA record industry, reacted strongly to the RIO and filed an injunction against the sale of the device. They claimed the device did not meet AHRA requirements, because it did not have a Single Copy Management System (SCMS). Diamond countered with the position that the RIO was not a recording device, but simply an IT device with playback functionality, outside of the scope of the AHRA. The judge sided with Diamond, which got it its way and today devices running MP3 decoding software are counted by the billions

A similar story, but with a different ending, happened with the personal MP3 file repository offered by MP3.com. The RIAA sued the company and eventually succeeded. Today the MP3.com website id on sale.

Napster is the name of the company that developed the first and for some time very successful P2P protocol. Here again the RIAA sued the company complaining that Napster was promoting large-scale copyright infringement and successfully made its case in the courts. As a result the number of users of the “service”, that in its heyday numbered over 80 million, dwindled to near zero leading the company to bankruptcy. Similar battles were waged against other incarnations of P2P companies.

Another case affected the DVD. A Norwegian hacker found the encryption keys on a faulty implementation of a DVD player and the information was posted on the web. In the MPAA vs 2600.com case, an American judge ruled that it was not just illegal to post the code of the DeCSS, as the ripper of Content Scramble System (CSS) was called, but it was also a crime to post hyperlinks to it. Some civil liberties associations said that posting computer code on the web, or printing it on a T-shirt, is a manifestation of the free speech right granted by the US Constitution. In the meantime people downloaded programs from the web for doing DeCSS and MPEG-4 Video + MP3 encoding. Users could then “copy” a movie from a DVD and burn it on a CD-ROM in the form of an MP3/MP4 compressed movie.

While it is clear that people have the right to make recourse to law to protect their rights and that judges must rule on the basis of what the law is at the moment the sentence is issued, I find it disheartening that the rights and the wrongs be determined by whether something is considered to be a playback or a storage device or a CE or an IT device or if a company is based in one country or in another or if a computer code is printed on a T-shirt or is a computer file. It looks like going back 22 centuries in China when Emperor Qin Shi Huang-Ti ordered all philosophers’ books to be burned. Were those books lost? No, simply disciples of those philosophers had learnt those works by heart and could communicate them to the next generation. So, shall we see hackers memorising the DeCSS code and communicate the code as word-of-mouth and across the centuries?

I believe society must recover the sense of ridicule. Electronic engineer and computer scientist are not standard bearers of the sense of ridicule, but the sample distinctions quoted above do not make any sense at all, at least to normal people. It seems that keeping these distinctions alive serves the interests of people who make a living out of confusion. The losers are the entrepreneurs who are deprived of their opportunities to try new businesses if they want to remain good citizens and end users who are deprived of the benefits brought by the wonders of technology, if they want to stay good citizens. Quite a few people, I dare say. And – sorry for forgetting them – rights holders as well. 


Opening Content Protection

MP3 has shown that the combination of technology and user needs can create mass phenomena behaving like a hydra. For every head of the MP3 hydra that was cut by a court sentence, two new heads appeared. How to fight the MP3 hydra, then? Some people fought it with the force of law but a better way was to offer a legitimate alternative providing exactly what people with their use of MP3 were silently demanding: any music, any time, anywhere, on any device. Possibly for free, but not necessarily so.

Unfortunately, this was easier said than done, because once an MP3 file is released, everybody can have it for free and the means to distribute files multiply by the day. The alternative could have been then to release MP3 files in encrypted form, so that the rights holder retained control, but then we would be back – strictly from the technology viewpoint – to the digital pay TV model with the added complexity that the web was not the kind of watertight content distribution channel that pay TV can be.

One day toward the end of 1997, Cesare Mossotto, then Director General of CSELT, asked me why the telco industry had successfully proved that it was possible to run a business based on a common security standard – embodied in the Subscriber Identification Module (SIM) card of GSM – while the media industry was still struggling with the problem of protecting its content without apparently being able to reach a conclusion. Shouldn’t it be possible to do something about it?

This question came at a time I was myself assessing the MPEG-2 take up by the industry three years after approval of the standard: the take off of satellite television that was not growing as fast it should have, the telcos’ failure to start Video on Demand services, the limbo of digital terrestrial television, the ongoing (at that time) discussions about DVD, the software environment for the Set Top Box and so on.

During the Christmas 1997 holidays I mulled over the problem and came to a conclusion. As I wrote in a letter sent at the beginning of January 1998 to some friends:

It is my strong belief that the main reason for the unfulfilled promise (of digital media technology – my note) lies in the segmentation of the market created by proprietary Access Control systems. If the market is to take off, conditions have to be created to move away from such segmentation.

By now you should realise that this sentence was just the signal that another initiative was in the making. I gave it the preliminary name “Openit!” and convened the first meeting in Turin, attended by 40 people from 13 countries and 30 companies where I presented my views of the problem. A definition of the goal was agreed, i.e. “a system where the consumer is able to obtain a receiver and begin to consume and pay for services, without having prior knowledge which services would be consumed, in a simple way such as by operating a remote control device”. Convoluted as the sentence may appear, its meaning was clear to participants: if I am ready to pay, I should be able to consume – but without transforming my house into the warehouse of a Consumer Electronics store, please.

OPIMA_interoperability

Figure 1 – A model of OPIMA interoperability

On that occasion the initiative was also rechristened as Open Platform for Multimedia Access (OPIMA).

The purpose of the initiative was to make consumers happy, by removing hassle from their life, because consumer happiness made possible by such a platform would maximise the willingness of end users to consume content to the advantage of Service Providers (SP), that in turn would maximise content provisioning to the advantage of Content Providers (CP). The result would be a globally enhanced advantage of the different actors on the delivery chain. The meeting also agreed to a work plan that foresaw the completion of specifications in 18 months, achieved through an intense schedule of meetings every other month.

The second OPIMA meeting was held in Paris, where the OPIMA CfP was issued. The submissions were received and studied at the 3rd meeting in Santa Clara, CA, where it also was decided to make OPIMA an initiative under the Industry Technical Agreement (ITA) of the IEC. This formula was adopted because it provided a framework in which representatives of different companies could work to develop a technical specification without the need to set up a new organisation as I had done with DAVIC and FIPA before.

From this, followed a sequence of meetings that produced the OPIMA 1.0 specification in October 1999. A further meeting was held in June 2000 to revise comments from implementors and produced version 1.1 that accommodated a number of comments received.

The reader who believes it is futile to standardise protection systems, or Intellectual Property Management and Protection Systems (IPMP-S), to use the MPEG-4 terminology, should wait until I say more about the technical content of the OPIMA specification. The fact of the matter is that OPIMA did not provide a standard protection system because OPIMA was a standard just for exchanging IPMP-Ss between OPIMA peers, using an OPIMA-defined protocol for their secure download. Thus the actual choice of IPMP-Ss was outside of the standard and was completely left to the users (presumably, some rights holders). The IPMP-S need not stay the same and could be changed when external conditions so determined.

An OPIMA peer produces or consumes Protected Content. This is defined as the combination of a content set, an IPMP-S set and a rules set that apply under the given IPMP-S. The OPIMA Virtual Machine (OVM) is the place where Protected Content is acted upon. For this purpose OPIMA exposes an “Application Service API” and an “IPMP Service API”, as indicated in the figure below.

opima2

Figure 2 – An OPIMA peer

The OVM sits on top of the hardware and the native OS but OPIMA does not specify how this is achieved. In general, at any given time an OPIMA peer may have a number of IPMP-Ss either implemented in hardware or software and either installed or simply downloaded. 

To see how OPIMA works, let us assume that a user is interacting with an application that may have been obtained from a web site or received from a broadcast channel (see Fig. 2 below).

opima3

Figure 3 – An example of operation of an OPIMA peer

When the user wants to access some protected content, e.g. by clicking a button, the application requests the functionality of the Application Service API. The OVM sets up a Secure Authenticated Channel (SAC). Through the SAC the IPMP-S corresponding to the selected content is downloaded (in an MPEG-4 application there may be several objects each possibly protected with its own specific IPMP-S). The content is downloaded or streamed. The OVM extracts the usage rules associated with the content. The IPMP-S analyses the usage rules and compares them with the peer entitlements, as provided, e.g. by a smart card. Assuming that everything is positive, the IPMP-S instructs the OVM to decrypt and decode the content. Assuming that the content is watermarked, the OVM will extract it and the watermark information handed over to the IPMP-S for screening. Assuming that the screening result is positive, the IPMP system will instruct the OVM to render the content.

It might be worth checking what are the degrees of similarity between the OPIMA solution and GSM. In GSM, each subscriber is given a secret key, a copy of which is stored in the SIM card and in the service provider’s Authentication database. The GSM system goes through a number of steps to ensure secure use of services: 

  • Connection to the service network.
  • Equipment Authentication. Check that the terminal is not blacklisted by using the unique identity of the GSM Mobile Terminal.
  • SIM Verification. Prompt the user for a Personal Identification Number (PIN), which is checked locally on the SIM. 
  • SIM Authentication. The service provider generates and sends a random number to the terminal. This and the secret key are used by both the mobile terminal and the service provider to compute, through a commonly agreed ciphering algorithm, a so called Signed Response (SRES), which the mobile terminal sends back to the service provider. Subscriber authentication succeeds if the two computed numbers are the same. 
  • Secure Payload Exchange. The same SRES is used to compute, using a second algorithm, a ciphering key that will be used for payload encryption/decryption, using a third algorithm. 

The complete process is illustrated in Figure 3.

gsm_model

Figure 4 – Secure communication in GSM

While the work carried out by FIPA, MPEG and OPIMA was moving on at the pace of one meeting every 50 days on average, I was contacted by the SDMI Foundation. SDMI had been established by the audio recording industry as a not-for-profit organisation at the end of 1998, ostensibly in reaction to the MP3 onslaught or, as the SDMI site recited 

to develop specifications that enable the protection of the playing, storing, and distributing of digital music such that a new market for digital music may emerge.

I was asked to be Executive Director of SDMI at a time I was already reaching physical limits with the said three ongoing initiatives – without mentioning my job at CSELT that my employer still expected me to carry out. If I had accepted the proposal I would run the risk that a fourth initiative would be one too many, but it was too strong an enticement to resist the idea of being involved in a high-profile type of media, already under stress because of digital technologies, helping shepherd music from the clumsy world of today to the bright digital world of tomorrow in an organisation that was going to be the forerunner of a movement that would set the rules and the technology components of the digital world. These were the thoughts I had after receiving the proposal which I eventually accepted.

The first SDMI meeting was held in Los Angeles, CA at the end of February 1999. It was a big show with more than 300 participants from all technology and media industries. The meeting agreed to develop, as a first priority, the specification of a secure Portable Device (PD). To me that first meeting signalled the beginning of an exciting period: in just 4 months of intense work, with meetings held at the pace of one every two weeks, a group of people who did not have a group “identity” a few weeks before was capable of producing the first SDMI Portable Device specification 1.0 (June 1999). All meetings but one were held in the USA and one every other meeting was of the Portable Device Working Group (PD WG), to whose chairman Jack Lacy, then with ATT Research, goes much of the credit for the achievement. I am proud to have been the Executive Director of SDMI at that time and preside over the work carried out by a collection of outstanding brains in those exciting months. 

I will spend a few words on that specification. The first point to be made is that PD 1.0 is not a complete standard, like any of the MPEG standards or even OPIMA because it does not say what a PD conforming to SDMI PD specification does with the bits it is reading. It is more like a “requirements” document that sets levels of security performance that must be satisfied by an implementation to be declared “SDMI PD 1.0 compliant”. It defines the following elements: 

  • “Content”, in particular SDMI Protected Content
  • “SDMI Compliant Application”
  • “Licensing Compliant Module” (LCM), i.e. an SDMI-Compliant module that interfaces between SDMI-Compliant applications and a Portable Device
  • “Portable Device”, a device that stores SDMI Protected Content received from an LCM residing on a client platform. 

Figure 4 represents a reference model for SDMI PD 1.0.

sdmi_reference_model

Figure 5 – SDMI Portable Device Specification Reference Model

According to this model, content, e.g. a music track from a CD or a compressed music file downloaded from a web site, is extracted by an application and passed to an LCM, from where it can be moved to the “SDMI domain”, housed by an SDMI compliant portable device.  

So far so good, but this was – in a sense – the easy part of the job. The next step was to address the problem that to build the bright digital future, one cannot do away with the past so easily. One cannot just start distributing all the music as SDMI Protected Content because in the last 20 years hundreds of millions of CD players and billions of CDs have been sold and these were all clear text (not to mention releases on earlier carriers). People did buy those CD’s in the expectation that they could “listen” to them on a device. The definition of “listen” is nowhere to be found, because until yesterday it was “obvious” that this meant enjoying the music by playing a CD on a player or after copying it on a compact cassette, etc. Today this means enjoying the music after compressing the CD tracks with MP3, moving the files on your portable device and so on…

So it was reasonable to conclude that SDMI specifications would apply only to “new” content. But then SDMI needed a technology that would allow a playback device to screen “new” music from “old” music. This was a policy decision taken at the London meeting in May 1999, but its implementation required a technology. The traditional MPEG tool of drafting and publishing a Call for Proposals was adopted by SDMI as well, a good tool indeed because asking people “outside” to respond requires – as a minimum – that one has a clear and shared idea of what is being asked.

The selection of what would later be called “Phase I Screening Technology” was achieved in October 1999. This was a “robust” watermark inserted in the music itself indicating that the music file is “new” content. Robust means that the watermark information is so tightly coupled with the content that even the type of sophisticated processing that is performed by compression coding is unable to remove it. Still, the music should not be affected by the presence of the watermark since we do not want to scare away customers. The PD specification amended by the selected Phase I Screening Technology was called PD 1.1. 

One would think that this specification should be good news for those involved, because everybody  – from garage bands, to church choirs, to major record labels – could now be in the business of publishing and distributing digital music while retaining control of it. 

This is the sequence of steps of how it should work:

  • The music is played and recorded.
  • The digital music file is created.
  • Screening technology is added.
  • The screened digital music file is compressed.
  • The compressed music file is encrypted.
  • The encrypted compressed music file is distributed.
  • A transaction is performed by a consumer to acquire some rights to the content.

But imagine I am the CEO of a big record label and there are no standards for identification, audio codec, usage rules language, Digital Rights Management (DRM), etc. Secure digital music is then a good opportunity to create a “walled garden” where content will only play on certain devices. With an additional bonus, I mean, that there is a high barrier to entry for any newcomer, actually much higher than it used to be, because to be in business one must make a number of technology licensing agreements, find manufacturers of devices, etc. A game that only big guys can expect to be able to play.

In June 1999, while everybody was celebrating the successful achievement of SDMI PD 1.0, I thought that SDMI should provide an alternative with at least the same degree of friendliness as MP3 and that SDMI protected content should be as interoperable as MP3, lest consumers not part from their money to get less tomorrow than they can have for free today. Sony developed an SDMI player – a technology jewel – and discovered at its own expense that people will not buy a device without the assurance that there will be plenty of music in that format and that their playback device will play music from any source. 

On this point I had a fight with most people in SDMI, from music records, CE and IT companies – at least those who elected to make their opinions known at that meeting. I said at that time and I have not changed opinion today that SDMI should have moved forward and made other technology decisions. But my words fell on deaf ears. 

Still, I decided to stay on, in the hope that one day people would discover the futility of developing specifications that were not based on readily available interoperable technologies, and that my original motivation of moving music from the troglodytic analogue-to-digital transitional age to the shining digital age would be fulfilled. 

I was illustrating the working of the robust watermark selected for Phase I. If the playback device finds no watermark, the file is played unrestricted, because it is “old” content. What if the presence of the watermark is detected, because the content is “new”? This requires another screening technology, that SDMI called phase II, capable of giving an answer to the question whether the piece of “new” content that is presented to an SDMI player is legitimate or illegitimate, in practice, whether the file had been compressed with or without the consent of the rights holders. 

A possible technology solving the problem could have been a so-called “fragile” watermark, i.e. one with features that are just the opposite of those of the phase I screening watermark. Assume that a user buys a CD with “new” content on it. If the user wishes to compress it in MP3 for his old MP3 player, there is no problem, because that player knows of no watermark. But assume that he likes to compress it so that it can be played on his feature-laden SDMI player that he likes so much. In that case the MP3 compression will remove the fragile watermark, with the consequence that the SDMI player will detect its absence and will not play the file. 

That phase of work, implying another technologically exciting venture, started with a Call in February 2000. Submissions were received at the June meeting in Montréal, QC and a methodical study started. At the September meeting in Brussels, with only 5 proposals remaining on the table, SDMI decided to issue a “challenge” to hackers. The idea was to ask them to try and remove the screening technology, without seriously affecting the sound quality (of course you can always succeed removing the technology if you set all the samples to zero). 

The successful hacker of a proposal would jointly receive a reward of 10,000 USD. The word challenge is somehow misleading for a layman, because it can be taken to mean that SDMI “challenged” the hackers. Actually that word is widely used in the security field when an algorithm is proposed and submitted to “challenges” by peers to see if it withstands attacks. Two of the five proposals (not the technologies selected by SDMI) were actually broken and the promised reward given to the winners. 

Unfortunately it turned out that none of the technologies submitted could satisfy the requirements set out at the beginning, i.e. being unnoticeable by so-called “golden ears”. So SDMI decided to suspend work in this area and wait for progress in technology. This, however, happened after I had left SDMI because of my new appointment as Vice President of the Multimedia Division at the beginning of 2001, after CSELT had been renamed Telecom Italia Lab and given a new, more market oriented, mission. 

I consider it a privilege to have been in SDMI. I had the opportunity to meet new people and smart ones, I must say – and they were not all engineers, nor computer scientists

Let we recount a related story. After I joined SDMI my standing with the content community increased substantially, in particular with the Società Italiana Autori ed Editori (SIAE), the Italian authours and publishers society. Eugenio Canigiani, a witty SIAE technologist, and I planned an initiative designed to offer SDMI content to the public and we gave it the name dimension@.

dimension_concept

Figure 6 – The dimension@ service concept

Dimension@ was planned to be a service for secure content distribution supporting all players of the music value chain. We hoped Telecom Italia and SIAE would be the backers, but it should not be a surprise if a middle manager in one of the two companies blocked the initiative.


Technology, Society and Law

When radio and television broadcasting were introduced, it was reasonable to expect that physical forms of distribution would eventually disappear. Why should people pay for something when the same thing is available for free over the air? This down-to-earth consideration, however, turned out to be wrong, because people listening to radio or watching television hear and see what others have decided that they should listen to or watch. When home recorders became available a similar forecast, that people will buy recorders and blank tapes, and will stop buying vinyl records and watching movies in theatres, was an equally easy guess. Why should people buy a record in the shop if, by waiting until the right time, they could get it for free from the broadcast channel and record it? Not only did the forecast not materialise – people have better things to do in their lives than scanning newpapers to see when a given program will be aired – but great new businesses, even overshadowing the old ones, were created.

Now come Digital Media and some people make the “obvious” forecast that they signal the end of media as we know them. People will be able to get anything for free, copy it as many times as they want and send it to as many people as they want. Why on earth should they buy something, when they can get exactly the same thing for free? It is an easy temptation to shrug off the threat of the doomsayers of the moment and conclude that, as much as the other media technologies that seemed to signal the end of the media world that we know, this time some equilibrium point will also be achieved that will magically leave things unchanged or, maybe, even create some great new business.

Before taking a position on this debate, let me make a disclaimer. When humans are involved, it is hard to say that there is bound to be a single outcome to a given problem and it may very well be that an equilibrium point can magically be found and that, some years from now, we will look back and conclude  that, once more, humans are resilient and adaptivity is their best feature. I personally do not believe this will happen, at least not in a clear-cut manner. If it will – and this seems the direction that we are heading – it is likely that end users will be the losers.

Broadcasting turned out not to be such a threat because being forced to consume what others had selected was not such a good proposition. Home recording was also not a threat because many felt it was more comfortable to buy what they were interested in than going through the hassle of actively searching for a broadcast or a cassette from a friend. Digital Technologies, instead, make it is possible to create a repository of all content on the network, and let computers do the job of searching, copying, organising and playing back the content a user is interested in. There is total availability and no hassle to discourage a user to go the “easy way” of getting things for free.

Many think that Technical Protection Measures (TPM) supplemented, where required, by legislation is the way to enable rights holders to enforce the rights that have been granted to them by law for centuries. The problem is that there are other rights that have also been granted by law to end users. 

Such a right is one of the first examples of attention by Public Authorities (PA) to the well being of the populace. In antiquity the entire body of knowledge of a community was spread among its members and each individual or group of individuals could contribute directly to the augmentation of the common knowledge and draw freely from it. Libraries were devised as a means to facilitate access to the common body of knowledge when its size had exceeded the otherwise considerable memorisation capabilities of people of those times. 

Scale excluded, the situation of today is not different from Babylon in 500 BC, Alexandria of Egypt in 100 BC or Rome in 200 AD. Public authorities still consider as part of their duties the provision of open and free access to content and use revenues of general taxation to allow free access to books and periodicals in a public library. In many countries they do the same with audio and video, by applying license fees to those who own a radio or television receiver. The same happens in universities where most relevant books and periodicals in a given field are housed in a faculty library. The difference here is that this “service” is offered from the revenues a university gets from enrollment fees. This kind of “free” access, ultimately, is not free at all. Someone has to pay for it, in a tortuous and not necessarily fair way. 

Protected content, has the potential to provide a more equitable way to allow access to the body of knowledge that society “owns” to the extent that it considers the access integral with the rights of citizens, while ensuring the rightful remuneration to those who have produced the content and enabled its representation and carriage. How should public authorities implement that access? Should the richness of information today be considered so large that there is no longer a need for public authorities to play a role in content access, even if citizens must pay to access it? Should PAs get a lump sum from rights holders in return from granting them monopoly rights to exploit their works? Should PAs get a percentage of rights holders’ current revenues to pay for a minimum free access to content to all citizens? There is no need to give an answer to these questions and there need not be a single answer. It should be left to each individual community to decide where to put the balance between the reward to individual ingenuity that originated new content and the society that provided the environment in which that content could be produced. 

Protected content also has an impact on another aspect. Today the Web houses an amount of information, accessible to every internet user, which has never been seen before in human history. Because information is in clear text and represented in a standard way, it is possible to create “search engines” that visit web sites, read the information, process it and extract that value added called “index”. Today the indices are the property of those who created the indices who can decide to offer access to anybody under the conditions they see fit. 

As long as the original information is freely available on the web, it should be OK for anybody to create whatever value added he can think of from it. But if content has value and an identified owner, it is legitimate to ask whether this value added processing can be left to anybody or whether the content owner should have its rights extended to this additional information. But if content is protected, it will be the rights holders themselves who will develop the indices, and there will be, in general, no way for a third party to do that. This has serious consequences on the openness of the information because the categories used by rights holders to classify the pieces of content are not necessarily the same as a third party, who might have completely different views of what makes a piece of content important. So far rights holders owned – obviously – the copyright to a given piece of content, but with control of indices they can even dictate how the content should be considered, prioritised and used. 

The nature of media so far has been such that the source was known but the target was anonymous. In most countries, one could purchase a newpaper at a newstand but the publisher would not know who the consumer was. One could receive a radio broadcast, but broadcasters would not know, except statistically and with varying degrees of accuracy, who their listeners were. Transactions, too, were anonymous. With digital networked media, the transactions, even fine grain patterns such as those related to individual web pages, become observable and the web has given the opportunity to web site owners to monitor the behaviour of visitors to their web sites. Every single click on a page can be recorded, and this enables a web site owner to create huge databases of profiles. As most of digital content will be traded by electronic means, the technical possibility exists for an organisation to build more and more accurate data about its customers or for a government to identify “anomalous” behavioural patterns within its citizens. The use that individuals, companies and governments will make of such information is a high-profile problem. 

The next issue with protected content is how users can preserve the same level of content accessibility they have today. PAs have often used standards to accomplish some goal. Television standards – and a lot of them exist today as I have reported – were often introduced at the instigation of PAs for the purpose of either protecting the content industry of one country or to flood other countries with content generated within the country. Traditionally, PAs have supported the approach of telecommunication standards with worldwide applicability because providing communication between individuals, even across political boundaries, was considered as part of their duty. The CE industry has always been keen to achieve standards for its products because that would increase the propensity of consumers to buy a particular product because of the added value of interoperability. The information technology industry has shunned standards in many cases: from the beginning, IT products were conceived as stand-alone pieces of equipment with basic hardware to perform computations, the OS assembling basic functionality but tied to the basic hardware, and applications, again designed to run on the specific OS. 

What is going to be the approach to standards when we deal with protected content? Indeed, if content is protected, standards may no longer be relevant. With its IPMP Extension MPEG has already provided solutions that preserve the most valuable goal for end users: interoperability at the level of protected content.  So it is possible to request rights holders to guarantee that people retain a practically enforceable right to access the content they are interested in, if they are ready to accept right holders’ access conditions

Lastly, a necessary condition for a practically enforceable right to free speech in Information Society is that individuals have general access to content protection technology. Indeed if TPMs will be used to package content so that it can be delivered in digital form, access to content protection technologies should not be discriminatory. This is a high societal goal that amounts to giving citizens the practical means to exercise their rights to free speech. In the digital world freedom of speech means being able to express oneself, making the expression available for other citizens to access, under the conditions that the originator sets, but letting the originator retain control of his expression. The achievement of this goal, however, must be balanced against the critical nature of protection technologies. A rogue user may be given access to a technology because of the need not to discriminate against him and as a result huge amounts of content may be compromised. In different times the management of these technologies would have been entrusted to the state. In the 21st century there should be better ways to manage the problem.