Last update: 2011/08/21
Reporting about a complex project to integrate most of the different technologies we have talked about so far into a usable standard.
In 1991, while MPEG-1 was maturing and the definition of MPEG-2 was rapidly progressing, I had begun to wonder whether there was a scope for work beyond what had been started in 1988, i.e. coding of audio and video for "high" bitrate applications, i.e. above 1 Mbit/s. I triggered some discussions at the Paris MPEG meeting in May 1991 and the obvious conclusion was that the lower end of the bitrate spectrum was a likely candidate for such a work.
That was far from a "new" area for audio and video coding. The ITU-T had been producing a number of speech coding standards aimed at reducing the canonical PCM rate of 64 kbit/s obtained from 8 kHz sampling and 8 bits per sample. Other bodies, like ETSI with GSM, were defining new speech codecs for mobile applications while work had also been done for a so called "wideband speech codec", i.e. a codec for speech sampled at a rate of 16 kHz and > 8 bits/sample, .
This area of low bitrate video coding had also attracted my attention before. Although in 1987 I was not as sceptical as today of everything related to person-to-person video communication, I felt the need to promote ISDN visual telephony simply because the field was moving at a snail's pace. The Picture Coding Symposium, the recognised academic forum for video coding, had papers on the topic but I thought that by starting the International Workshop on 64 kbit/s coding of moving video promoting more focused R&D, I could speed adoption of visual telephony on ISDN. The very H.261 project, started from transmission rates of nx384 kbit/s (384 being the the minimum common denominator between European and American rates of 2048 and 1536, to accommodate the old transmission multiplexing split), had soon been changed to a project for coding at px64 kbit/s - where p was allowed to assume any value from 1 to 30 - when it became clear that 384 kbit/s was too high a transmission speed to be of practical interest, while 128 kbit/s of ISDN lines were sitting idle waiting for applications. The ITU-T had even started a new project, called H.263, to develop a video codec to improve the performance of H.261 for the lower bitrates. That was partly because of the new results brought by the insuppressible activity of Gisle Bjøntegaard then of Norwegian Telecom, and because of two announcements of consumer-grade videophones for analogue telephone lines based on proprietary solutions.
All these, however, were initiatives dealing specifically with real-time person-to-person telecommunication applications, the bread and butter of ITU-T, while at that time MPEG had already fully embraced the "generic" approach to media coding standards, aiming at defining the basic coding technology that application domains, possibly including telecommunication, would then customise for their own specific needs. For a few meetings, MPEG kept on discussing the topic and 18 months later had gone a long way in the identification of what it could mean for MPEG to develop a standard in this area. At the SC 29 meeting in November 1992 in Ottawa I presented, on behalf of SC 2/WG 11, the proposal for a new project with the title "Very low bitrate audio-visual coding" that was unanimously approved. It took longer than usual for JTC1 to approve the project, but at the July 1993 New York meeting news came that the project had been finally approved.
Cliff Reader, who had joined Samsung after leaving Cypress, was appointed as chair of a new AHG with the task of identifying applications and requirements of the new project that had already been christened MPEG-4, before the early dismissal of the MPEG-3 project in July 1992. At the following meeting in September 1993 in Brussels, the ad hoc group was turned into a standing MPEG subgroup with the name of Applications and Operational Environments (AOE). In its 30 months of existence the subgroup was due to generate some of the most innovative ideas that characterise the MPEG-4 standard.
The original target of the MPEG-4 project was of course another jump in audio and video compression coding. At the beginning there were hopes that model-based coding for video would provide significant improvements, but it was soon realised that what was being obtained by the ITU-T group working on H.263 was probably going to be close to the best performance obtainable by the type of algorithms known as "block-based hybrid DCT/MC" that until then MPEG and ITU-T video coding standards had been based on. For audio, the intention was to be able to cover all types of audio sources, not just music but speech as well, for a very wide range of bitrates. The development of MPEG-2 AAC, barely begun at that time, prompted the realisation that AAC could become the "bridge" between the "old" MPEG-1/2 world and the "new" MPEG-4 world. Most of the AAC development and the entire MPEG-4 version 1 development was led by Peter Schreiner of Scientific Atlanta who was appointed as Audio Chair at the Lausanne meeting in March 1995.
At the Grimstad meeting in July 1994, the scope of the project was reassessed and the conclusion was reached that MPEG-4 should provide support to an extended range of features beyond compression. These were grouped in three categories: 1. content-based interactivity, i.e. the ability to interact with units of content inside the content itself, 2. compression and 3. universal access, including robustness to errors, and scalability.
While the understanding of requirements progressed, I came to realise that MPEG-4 should not just be yet another standard that would accommodate more requirements for more applications than in the past. MPEG-4 should allow more flexibility than had been possible before to configure the compression algorithms. In other words the goal of the MPEG-4 standard should extend beyond the definition of complete algorithms to cover coding tools. The alternatives were called "Flex0", i.e. the traditional monolithic or profile-based standard and "Flex1", i.e. a standard that could be configured as an assembly of standardised tools. Unfortunately an uninvited guest called "Flex2" joined the party. This represented a standard where algorithms could be defined simply by using an appropriate programming language.
This was the first real clash between the growing IT and the traditional Signal Processing (SP) technical constituencies within MPEG. The former clearly liked the idea of defining algorithms using a programming language. The decoder could then become a simple programmable machine where the algorithm used to code the specific piece of content would be downloaded, possibly with the content itself. If practically implementable, Flex2 would have been the ultimate solution to audio and video coding. Unfortunately, this was yet another of the recurring dreams that would never work in practice. "Never", of course, being defined as "for the foreseeable future".
Even if the programming language had been standardised, there would have been no guarantee that the specific implementation using the CPU of the decoder at hand would have been able to execute the instructions required by the algorithm in real time. Flex1 was the reasonable compromise whereby the processing-intensive parts would be standardised in the form of coding tools, and could therefore be natively implemented ensuring that standardised tools would be executed in real time. On the other hand, the "control instructions" linking the computationally-intensive parts could withstand the inefficiency of a generic programming language.
It was not to be so. The computer scientists in MPEG pointed out that a malicious programmer, possibly driven by an even more malicious entrepreneur seeking to break competitors' decoders, could always make the control part complex enough, e.g. by describing one of the standard tools in the generic programming language, so that any other Flex1 decoder could be made to break. So it was eventually decided that MPEG-4, too, would be another of the traditional profile-based, monolithic coding standards.
In the meantime, work on refining the requirements was continuing. The first requirement, i.e. content-based interactivity, is now, after years of web-based interactivity, easy to explain. If people like the idea of interacting with text and graphics in a web page, why should they not like to do the same with the individual elements of an audio-visual scene? In order to enable independent access to each element of the scene, e.g. by clicking on them, it would be necessary to have the different audio and visual objects in the scene represented as independent objects.
Further, at the Tokyo meeting in July 1995 it was realised that this object composition functionality would enable not just the composition of natural but also of synthetic objects in a scene. Therefore MPEG started the so-called Synthetic and Natural Hybrid Coding (SNHC) activity that would eventually produce, among others, the face and body animation, and 3D Mesh Compression (3DMC) parts of the MPEG-4 standard. At the same meeting, the first MPEG-4 CfP was issued. The call sought technologies supporting eight MPEG-4 functionalities. Responses were received by September/October and evaluated partly by subjective tests and partly by experts panels. The video subjective tests were performed in November 1995 at Hughes Aircraft Co., in Los Angeles, while the audio subjective tests were performed in December 1995 at CCETT, Mitsubishi, NTT and Sony. At that meeting Laura Contin of CSELT replaced Hidaka-san as the Test Chair.
At the Munich meeting in January 1996 the first pieces of the puzzle began to come into place. The MPEG-4 Video Verification Model (VM) was created by taking H.263 as a basis and adding other MPEG-4 specific elements such as the Video Object Plane (VOP), i.e. a plane containing a specific Video Object (VO), possibly of arbitrary shape. At the same meeting the title of the standard was changed to "Coding of audio-visual objects". Later, other CfPs were issued when new functionalities of the standard required new technologies. This happened for synthetic and hybrid coding tools in July 1996, a new general call for video and audio in November 1996, identification and protection of content in April 1997, intermedia format in October 1997 and others.
The MPEG-4 Audio work progressed steadily after the tests. Different classes of compression algorithms were considered. For speech, Harmonic Vector eXcitation Coding (HVXC) for a recommended operating bitrate of 2 - 4 kbit/s, and Code Excited Linear Predictive (CELP) coding for an operating bitrate of 4 - 24 kbit/s, For general audio coding at bitrates above 6 kbit/s, transform coding techniques, namely TwinVQ and AAC, were developed.
At the Florence meeting in March 1996 the problem of the MPEG-4 Systems layer came to the fore. In MPEG-1 (and MPEG-2 PS) the Systems layer is truly agnostic of the underlying transport. In the MPEG-2 Transport Stream, the Systems layer includes the transport layer itself. What should be the MPEG-4 Systems layer? Carsten Herpel of Thomson Multimedia, one of the early MPEG members who had represented his company in the COMIS project, was given the task to work on this aspect, making sure that the old Systems experience of previous standards would be carried over to MPEG-4. The problem to be solved was how to describe all the streams, including information such as media, their coding, the bitrate used, etc. as well as the relations between streams, the means to achieve a synchronised presentation of all the streams, that included a timing model and a buffering model for the MPEG-4 terminal.
The Florence meeting also marked the formal establishment of the Liaison group. Since the very early days of MPEG, I had put particular attention in making the outside world aware of our work, but the management of this "external relations" activity had been dealt with in an ad-hoc fashion. In Florence I realised that the number and importance of incoming liaison documents had reached the point where MPEG needed a specific function to deal with them on a regular basis. I then asked Barry Haskell of Bell Labs, a key figure in the development of video coding algorithms and the last remaining person, aside from myself, of the original group of 29 attendees at the fist MPEG meeting in Ottawa, to act as chair of the "Liaison group". Barry played this important role until 2000 when he left his company. His role was taken over by Jan Bormans of IMEC and then by Kate Grant until the dissolution of the group in 2008.
At the Munich meeting in January 1996, Cliff Reader had announced that he would be leaving Samsung and his participation in MPEG was discontinued for the second time. The burgeoning AOE group was split in three parts at the Tampere meeting in July 1996. The first was the Requirements group that was reinstated after Sakae Okubo had left MPEG at the Tokyo meeting in July 1995. The second was the Systems group, which had been chaired by Jan van der Meer since the Lausanne meeting in March 1995 and the third was the SNHC group. The Chairmen of the three groups became, respectively, Rob Koenen, then with KPN, Olivier Avaro of France Telecom R&D and Peter Doenges of Evans and Sutherland. With the replacement of Didier Le Gall by Thomas Sikora of HHI already done in 1995, the new MPEG management team was ready for the new phase of work.
The support the new functionality the Systems group needed a new technology not needed in MPEG-1 and MPEG-2, namely the ability to "compose" different objects in a scene. The author of a scene had to be given a composition technology that would tell the decoder where to position audio and visual objects. This would be the MPEG-4 equivalent of the role of a movie director who instructs the scene setter to put a table here and a chair there and asks an actor to enter a room through a door and pronounce a sentence, and another to stop talking and walk away. This "composition" feature was already present in MHEG but was limited to 2D scenes and rectangular objects.
An MPEG-4 scene could be composed of many objects. The range of object types was quite wide: rectangular video, natural audio, video with shape, synthetic face or body, generic 3D synthetic objects, speech, music, synthetic audio, text, graphics, etc. There could be many ways to compose objects on the screen and in sound space, including spatial composition and temporal composition. The features to be expressed in spatial composition could be: is this a 2D or a 3D world, where does this object go, in front of or behind this other object, with which transparency and with which mask, does it move, does it react to user input, etc. The features to be expressed in sound composition information could include those just mentioned, but could also include others that are specific to sound, such as room effects. The features to be expressed in a temporal composition could include: the time when an object starts playing, measured relative to the scene or to another object's time, etc.
So there was the need to specify composition directives, these directives being collected in the so-called scene description. MPEG could have defined its own scene composition technology, but that would not have been a wise move. It would have required considerable resources and it would have taken quite some time to have the technology mature enough to be promoted to astandard. On the other hand in early 1997, when the issue started becoming hot, the Virtual Reality Modeling Language (VRML), as specified by VRML97, was getting momentum in the 3D Graphics community and, in the best spirit of building bridges between communities and improving interoperability between application areas, it made sense to extend that technology and add the features required by MPEG-4 .
This is how the BInary Format for MPEG-4 Scenes (BIFS) activity, led by Julien Signès, then with France Telecom, and subsequently by Jean-Claude Dufourd of ENST, started working on the problem at the February 1997 meeting in Seville. The VRML specification, that had already been taken over by SC 24 for conversion to an ISO/IEC standard as ISO/IEC 14772, was extended to provide such functionalities as 2D composition, the inclusion of streamed audio and video, natural objects, generalised URL, composition update and, most important, a compression format.
At the time of the Seville meeting, the idea of making an MPEG-4 terminal a programmable device had been abandoned and the decision to use VRML 97, a declarative composition technology, was made. Still it was considered useful to have the means to support some programmatic aspects, so that richer applications could become possible or easier to make. The MHEG group had already selected Java as the technology to enable the expression of programmatic content and the DVB project would adopt Java as the technology for its Multimedia Home Platform (MHP) solution some time later. It was therefore quite natural for MPEG to make a similar choice. MPEG-J was the name given to that programmatic extension. MPEG-J defines a set of Java Application Programming Interfaces (API) to access and control the underlying MPEG-4 terminal that plays an MPEG-4 audio-visual session. Using the MPEG-J APIs, the applications have programmatic access to the scene, network, and decoder resources. In particular it becomes possible to send an MPEG-let (a MPEG-J Application) to the terminal and drive the BIFS decoder directly.
The availability of a composition technology gave new impetus to the SNHC work. Several key technologies were defined, the most important of which were animation of 2D meshes and Face and Body Animation (FAB). The Vancouver meeting of SNHC (July 1999) was the last chaired by Peter Doenges. At the Melbourne meeting Euee S. Jang, then of Samsung, replaced him and a new piece of work called Animation Framework eXtension (AFX) started. This provides an integrated toolbox for building attractive and powerful synthetic MPEG-4 environments.
MPEG-4 defines a coded representation of audiovisual content, much as MPEG-1 and MPEG-2. However, the precise way this coded content is moved on the network is to some extent independent of the coded content itself. At the time MPEG-2 was defined, virtually no broadband digital infrastructure existed. Therefore it was decided to define both the coded representation of content and the way to multiplex the coded data with signaling information into one serial bitstream. At the time of development of the MPEG-4 standard, it became clear that it would not be practical to define a specific solution for content transport, because the available options were MPEG-2 itself, IP, ATM (a valid option at that time but no longer so today) and H.223, the videoconferencing multiplex of ITU-T.
Therefore MPEG decided that, instead of defining yet another transport multiplex, it would specify just the interface between the coded representation of content and the transport for a number of concurrent data streams. But then it became necessary to define some adaptations of the MPEG-4 streams to the other transport protocols, e.g. MPEG-2 TS, so as to enable a synchronised real-time delivery of data streams.
This included a special type of "transport", i.e. the one provided by storage. Even though one could envisage the support of different types of formats, an interchange format would provide benefits for content capture, preparation and editing, both locally or from a streaming server. A CfP was issued at the November 1997 meeting in Fribourg (CH) and the QuickTime proposal made by Apple with the support of a large number of USA IT companies provided the starting point for the so-called MP4 File Format. David Singer of Apple has been driving the work that led to the development of the MP4 FF and other related formats.
Apart from the file format, MPEG has specified only one other content delivery tool - M4Mux - a simple syntax to interleave data from various streams. Since all relevant delivery stacks support multiplex, usage of this tool is envisaged only in cases where the native multiplex of that delivery mechanism is not flexible enough.
From the very early phases of MPEG-4, MPEG was confronted with the task of providing the equivalent of the MPEG-2 DSM-CC protocol. Here again most of the people who had developed that standard had left the committee, but fortunately not Vahe Balabanian, then with Nortel, who was well known for the piles of contributions he had provided at every meeting where DSM-CC was discussed. Vahe came with the proposal of developing a DSM-CC Multimedia Integration Framework (DMIF) and a new group called "DMIF", where the acronym DSM-CC was replaced by "Delivery", was established and Vahe was appointed as chairman of that group at the Stockholm meeting in July 1997.
The idea behind DMIF is that content creators benefit from being able to author content in a way that makes it transparent, whether the content is read from a local file or streamed over a two-way network or is received from a one-way channel, such as in broadcast. This is also beneficial to the user because the playback software just needs to be updated with a "DMIF plug in" in order to operate with a new source. The solution is provided by a DMIF Application Interface (DAI), an interface that provides homogeneous access to storage or transport functionalities independently of whether a stream is stored in a file or delivered over the network or received from a satellite source. The December 1998 meeting in Rome was the last Vahe chaired. From the Seoul meeting in March 1999 Guido Franceschini of CSELT took over as chairman.
Three additional technologies called FlexTime, eXtensible MPEG-4 Textual Format (XMT) and Multiuser Worlds (MUW) were added. The first augments the traditional MPEG-4 timing model to permit synchronization of multiple streams and objects that may originate from multiple sources. The second specifies a representation of MPEG-4 scene that is textual (as opposed to the binary BIFS representation) and its conversion to BIFS. The third technology enables multiple MPEG-4 terminals, sharing an MPEG-4 scene and updating scene changes in all terminals.
The last major MPEG-4 technology element considered in this chapter is Intellectual Property Management and Protection (IPMP). Giving rights holders the ability to manage and protect their multimedia assets was a necessary condition for acceptance of the MPEG-4 standard. A CfP was issued in Bristol and responses received in July 1997. This part of the standard, developed under the leadership of Niels Rump, then with Fraunhofer Gesellschaft (FhG), led to the definition of "hooks", hence the terms IPMP Hooks (IPMP-H) allowing the plug in of proprietary protection solutions.
Another challenge that the new environment posed was that of being able to respond to evolving needs, a requirement that MPEG-1 and MPEG-2 did not have. After being approved nothing was added to the former and in the first years just some minor enhancements - the 4:2:2 and the Multiview Profiles - were added to MPEG-2. In MPEG-4 it was expected that the number of features to be added would be considerable and therefore the concept of "versions" was introduced. Version 1 of the standard was approved in October 1998 at the Atlantic City, NJ meeting. Version 2 was approved in December 1999 at the Maui, HI meeting. Many other technologies have been added to MPEG-4 such as streaming text and fonts.