Archives: 2015-August-20

MPEG-1 Reference Software

One morning of July 1990, Arian Koster of KPN Research called me to make a suggestion: “What if MPEG developed a software implementation of the MPEG-1 standard?”. My immediate reaction was to ask what MPEG would gain from this. He said that various companies had already developed their own software implementations of the Video Simulation Model, because that was a necessary step for anybody to take part in video Core Experiments and that more software would also be developed for Audio and Systems. If everybody would contribute just a small portion of their code, MPEG could have a complete software implementation of the standard, everybody in MPEG would be able to use it and MPEG would get the benefit of many independent users of the software. Frankly, I did not see at that time for what reasons anybody should give away part of their code, but it has always been my policy not to disallow something other people believed in just because I did not understand it. I never had to regret this policy and certainly not in this case, which created the seeds for one of the major innovations in MPEG, as we will see in a moment. 

Slowly, the idea made impacts and, already at the first Santa Clara, CA meeting in September 1990, the Audio group, made of some of the most contentious people in MPEG, but also of the most open to novelties and well structured in their implementations, had already proposed an ad hoc group on “Software Simulation – Audio”. With the contribution of many, in 1994 MPEG could release part 5 of MPEG-1: “Software simulation” (part 4 had already been assigned to “Conformance Testing” of which we will say more later). While the first four parts of MPEG-1 are normative, in the sense that if you want to make conforming products or generate bitstreams that are decodable (i.e. understandable and actionable by an independently implemented decoder) they must satisfy the requirements of the relevant parts of the standard, part 5 is a Technical Report (TR), i.e. a document that is produced for the general benefit of users of the standard, but has no normative value. It is, in ISO language, “informative”.

This is as much as MPEG could progress in those early times on “software implementation of a standard”. But this was just the beginning of a much bigger thing as we will see in later pages. 

 


Conformance

MPEG-1 is a great standard, but there is a potential problem in its practical adoption. Imagine I am a manufacturer and I choose to be in the business of making MPEG-1 encoders and decoders. I believe I have faithfully implemented all normative clauses in ISO/IEC 11172-1, -2 and -3 and I have checked that my decoder correctly decodes content generated with my encoder. Now, a customer of mine buys my MPEG-1 decoder and starts using it to decode content produced by an encoder manufactured by a competitor. Unexpectedly he encounters problems. My customer talks to my competitor and he is shown that content generated by his encoder is successfully decoded by his decoder. Who is right? Who is to blame? My competitor or myself or both? 

This problem is not new to ISO and IEC. The “Procedures for the Technical Work” prescribe that a standard must contain clauses that enable users of the standard to assess whether an implementation conforms to the standard. One could even say that a standard is useless if there are no procedures to assess conformity to it. It would be like issuing a law without having courts one can make recourse to assess “conformity” of a specific action with the law. 

The conformance problem used to be less well known to ITU because, when telcos were a regulated business providing a “public service”, they performed the “conformity” tests themselves to make sure that terminals from different manufacturers would interoperate correctly on their networks, without exposing subscribers to the kind of incompatibilities I have just described. They used to put a “seal of approval” on conforming terminals. Telcos used to do this because it was part of their public service licence, but also because keeping subscribers happy and not letting them suffer from incompatibilities was “good for business”, an old wisdom that too many proponents of new business models seem to disregard or forget.

When the telecommunication business became deregulated, independent Accredited Testing Laboratories (ATL) were set up. For a fee ATLs issued certifications to products that had successfully passed the conformance test. But even ATLs are a byproduct of the traditional “public service” attitude of the telcos. 

In the IT and CE domains, which MPEG-1 ends up also sort of belonging to, the attitude has always been more “relaxed”. If you buy a mouse and you discover that it does not work on your computer, what do you do? If you are lucky the shop you bought it from will refund your money, if not you are stuck with a lemon. The same if you buy a component for your stereo. Sure, the consumer is protected, because if he is dissatisfied he can always take legal proceedings… The attitude of the IT and CE industry has always been one of either not claiming anything or, at most, of making “self-certification” of conformity. 

That, however, is something that may work well when there is a market with large companies, producing mass market products, possibly not terribly sophisticated and where the product itself depends on a key technology licensed by a company that puts conformity of the implementation – to be verified by the licensing company – as a condition for licensing. Licensors have an interest to make sure that licensees behave correctly because they are interested in the good name of the technology and, again, because that leads to satisfied customers and hence more revenues. This was regularly the case of major CE products. 

Virtually none of these conditions applied to MPEG-1, and certainly not the last. There are multiple patent holders for the MPEG-1 standard but none, in general, has the authority or interest to become the “godfather” and oversee the correct implementation of the standard. Therefore the approach adopted by MPEG has been to develop a comprehensive set of tools to help people make independent checks of an implementation for conformance. 

Part 4 of MPEG-1 “Conformance” gives guidelines on

  1. how to construct tests to verify bitstream and encoder conformance
  2. test suites of bitstream that can be used to verify decoder conformance. 

Encoder conformance can be verified by checking that a sufficient number of bitstreams generated by the encoder under test are successfully decoded by the reference software decoder. Decoder conformance can be verified by bitstream test suites or by decoding bitstreams generated with the reference software.

A diligent reader, after reaching this point might like to know more technical details about the MPEG-1 standard. Such a curiosity could find a response in the MPEG-1 resource page of mpeg.chiariglione.org that acted as _the_ MPEG group’s web site until the group’s dissolution in the fall of 2020 by clear forces urged by oscure forces.


Inside MPEG-1

MPEG-1’s formal name is ISO/IEC 11172. ISO/IEC refers to the fact that the standard is recognised by both ISO and IEC because JTC 1, the Technical Committee under which MPEG operates, is a joint ISO and IEC Technical Committee. The number “11172” is a 5-digit serial number that the ISO Central Secretariat assigns to identify a new project of standard and that follows it throughout its life cycle. 

The title of MPEG-1 is rather convoluted: “Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s”. To make things worse, it is also not very true. MPEG-1 can very well be used in other environments than digital storage media and the reference bitrate of 1.5 Mbit/s (the comma used in the 1,5 of the title is because of ISO conventions) – the original driver of the work – appears only in the title, but is nowhere normatively referenced in the text of the standard. More about this later in this page. 

A major departure from other similar standards, such as those produced by the ITU-T in the audio coding domain, is that MPEG-1 does not define an End-to-End “delivery system”, it only defines the receiving end of it. Actually, it does not even do that, because the standard only provides the “information representation” component of the audio-video combination, in other words the format of the information (bitstream) entering the receiver. This feature was common with most MPEG standards.

Philosophically this is quite pleasing and is the implementation of a principle that should drive all communication standards: define how to understand the message, not how to build the message. In other words, a communication standard (I would submit that this should be true for all standards, but this would take me astray) should specify the bare minimum that is needed for interoperability. A standard that over-specifies the domain of its applicability is likely to make more harm than good. On the other hand, what is the purpose of a standard that does not address interoperability? When I hear members of some standards committees stating: “from now on we will work on interoperability”, I always wonder what they have been doing until then.

Having MPEG-1 (and all other standards following it) written in a decoder-centric way has been a personal fulfilment. Indeed, back in 1979 I submitted to a COST 211 meeting a version of the H.120 Recommendation re-written with a decoder-centric viewpoint: describe what you do with the serial bits entering the decoder and how they are converted into pixels, as opposed to the encoder-centric viewpoint of the Recommendation that starts from what you do with video and audio pixels entering the encoder and how they are converted into a bitstream on the output wire. Further, the MPEG-1 standard does not say anything about how the coded information is actually carried by a delivery system. It only specifies some general characteristics of it, the most important of which is to be error-free. 

This contrasts with the approach that was followed by other, industry-specific, communication standards like Multiplexed Analogue Components (MAC) developed by the EBU in the 1980s.

MAC_reference_model

Figure 1 – Multiplexed Analogue Ccomponents (MAC) reference model

Indeed, the MAC standard is a definition of a complete transmission system where everything is defined, from the characteristics of the 11 GHz channel of the satellite link down to the data multiplex, the video signal characteristics, and the the digital audio and character coding. I say “down” because, this is broadcasting, and the physical layer is “up”. 

To be fair, I am comparing two systems designed with very different purposes in mind. MAC was a system designed by an industry – European broadcasters  – writing a system specification for their members, while MPEG-1 is a specification written for a multiplicity of industries, some of which MPEG did not even anticipate at the time it developed the standard. Even so, I think it is always helpful, whenever possible, to write technical specifications isolating different subsystems and defining interfaces between them.

In this context it is useful to consider the figure below

standard_interfaces

Figure 2 -Standards and interfaces

where System A and System B are separated by Interface X. In MPEG – and in many other standard – interfaces is what a standard is about and nothing else. If an implementation declares that it exposes the X standard interface, then the implemented interface must conform to the referenced standard. The same implementation may claim that it also exposes Interface Y which must conform to the referenced standard or it may be silent on it and then Interface Y in the implementation, may very well not exists or, if it does, it may be anything the manufacturer has decided.

Defining interfaces in a standard makes the design cleaner, it facilitates reusability of components and creates an ecosystem of competing suppliers from system level down to component level. This last point is the reason why the open interface approach to standards is not favoured by some industries – say telco manufacturers – and the reason why IT products have overtaken part of their original business.

MPEG-1 is then another departure from the traditional view of standards as monoliths. 

Even though the three parts of ISO/IEC 11172 are bound together (“one and trine”, as MPEG-1 was called), users do not need to use all three of them. Indeed, a “part” of a standard is itself a standard, so it is possible to only use the Systems part and attach proprietary audio and video codecs. Not that this is encouraged, but this “componentisation” approach extends acceptance of the standard and lets more manufacturers compete without preventing the provisioning of complete system. It is clear that, if some customers do not need or want to have the pieces of the standard (providing an interface cost more than having no interface at all), they can order a single system from one supplier without interfaces.

The figure below depicts the components of the MPEG-1 standard. 

MPEG-1_reference_model

Figure 3 – MPEG-1 Reference model

Systems

Serial bits arrive at the MPEG-1 decoder from a delivery medium-specific decoder (e.g. the pick up head of a CD player) and are interpreted by the Systems decoder that passes on “video bits” to the Video decoder and “audio bits” to the Audio decoder along with other information. 

The Systems decoder processes the other non-audio and non-video bits contained in the bitstream, i.e. those carrying timing and synchronisation information, and the result of the processing is handed over to the Video and Audio decoders. It is to be noted that MPEG-1 is capable of handling an arbitrary number of compressed video and audio streams with the constraint that these must all have the same time base

Two main pieces of timing information are extracted by the system decoder and passed on to the audio and video decoders: the Decoding Time Stamp (DTS) telling a decoder when to decode the video or audio information that has been so time-stamped and the Presentation Time Stamp (PTS) telling a decoder when to present (i.e. display, in the case of video) the video or audio information that has been so stamped. In this way an MPEG-1 stream is a self-contained piece of multimedia information that can be played back without the need of a lower-layer entity, such as a transport. 

MPEG-1 Systems specifies how bitstreams of compressed audio and video data are combined. A packet-based multiplexer serialises the audio and video streams and keeps them synchronised. MPEG-1 Systems assumes that the reference time base is provided by a Systems Time Clock (STC) operating at 90 kHz (=1/300 of the 27 MHz sampling frequency used for digital video). STC values are represented with 33-bit accuracy and incremented at 90 kHz rate. The bitstream carries its own timing information in the Systems Clock Reference (SCR) fields. PTSs, represented with 33-bit accuracy, give the time the author expects the audio or video information to be presented. Note that MPEG-1 does not say anything about the use that will be made of audio or video samples, because MPEG-1 is silent on how decoded samples are actually presented. The processing of an MPEG-1 bitstream requires a buffer. Therefore MPEG-1 Systems utilises a Systems Target Decoder (STD) model and DTSs. The latter are required because MPEG-1 Video makes use of B-pictures which imply that pictures at the decoder side are reordered.

Video

The input video can be modeled as three 3D arrays of pixels, where the first two dimensions relate to the spatial visual information and the third one corresponds to time. The MPEG-1 Video coding can be defined as a function producing a bitstream taking three 3D arrays of pixels as input. However, unlike other standards, such as H.261, MPEG-1 does not have any pre-specified value for these 3D arrays of pixels. In particular, it says nothing about the size of the picture, that can be any value up to the maximum size of 4,096×4,096 pixels, and also says nothing about the time spacing between two consecutive 2D arrays of pixels (i.e. frame frequency) that can assume any value from slightly more than 1/24 s to 1/60 s. The only major – and deliberate – constraint is that the spatial position of pixels in consecutive pictures be the same. In other words MPEG-1 Video can only handle “progressive”, i.e. not interlaced, pictures.

Keeping the described flexibility in terms of number of pixels per line, lines per picture and pictures per second is the right thing to do when writing a standard that is conceived as an abstract Signal Processing (SP) function that operates on the three 3D arrays of pixels to produce a bitstream at the decoder and performs the opposite functions at the decoder. Obviously it is not the right thing to do when a company makes a specific product because a decoder capable of decoding any picture size at any bitrate must be so overdesigned that its cost can easily put the product out of the market. 

This is the reason why MPEG-1 Video specifies a set of parameters, called Constrained Parameter Set (CPS), given by the table below. This correspond to a “reasonable” set of choices for the market needs at the time the standard was produced. 

Table 1 – MPEG-1 Video Constrained Parameter Set 

Parameter   Value Units 
Horizontal size  768   Pixels
Vertical size  576  Lines 
No. of macroblocks/picture  396 
No. of macroblocks/second  9900 
Picture rate  30  Hz 
Interpolated pictures  
Bitrate  1,856  kbit/s 

 

The maximum number of horizontal pixels – 768 – was a commonly used value for computer displays when MPEG-1 was approved and the maximum number of scanning lines is the number of active lines in a PAL frame. 396 is the maximum number of macroblocks/picture for a “PAL” SIF (288/16 x 352/16), the corresponding value for “NTSC” SIF being 330 (240/16×352/16), and the maximum number of macroblocks/second is the same for both PAL and NTSC (396×25 and 330×30, respectively). The maximum bitrate is the net bitrate in a 2,048 kbit/s primary multiplex when one time slot has been used for, say, audio (29×64) in addition to TS0 and TS16, which are not available for the payload. This is the only place where a number, somehow related to 1.5 Mbit/s of the title, is used.  

This table highlights the fact that an implementation of an MPEG-1 Video decoder is constrained by the size of the RAM (a multiple of 288×352), the number of memory accesses per second (a multiple of 288x352x25 or 240x352x30), the bitrate at the input of the decoder and the number of pictures that can be interpolated. Therefore the hardware constraints are independent of whether the pictures are from an NTSC or PAL source.

As mentioned above, MPEG-1 Video is basically an outgrowth of H.261 with some significant technical differences mentioned above and the flexibility of video formats that were limited to CIF or ¼ CIF in H.261. The Figure below represents a general motion compensated prediction and interpolation.

mc-pred-interp

Figure 4 – General scheme of motion compensated prediction and interpolation video coding

The first is the introduction of an “intraframe mode” (called I-pictures) that can be used to insert programmed entry points where information does not depend on the past, as a series of predictive pictures (called P-pictures) would make it. This is one major requirement coming from the flagship application of “storage and retrieval on Digital Storage Media” (but the same requirement exists for “tuning in” a broadcast programme). The second is the the addition of frame interpolation (called B-pictures) to frame prediction.MPEG-1 Video has then 3 types of pictures as depicted in the figure below.

picture_types_in_MPEG-1_Video

Figure 5 – Picture types in MPEG-1 Video

This feature had been considered in the development of H.261 but discarded because of the coding delay it created for real-time communication. Indeed, for “storage and retrieval on DSM”, the short additional delay caused by B-pictures is largely compensated for by the considerable improvement in picture quality. It is also clear that the more pictures are interpolated, the more memory is needed and for this reason the CPS restricts this flexibility to up to 2 B-pictures. 

The figure below represents the hierarchical nature of MPEG-1 Video data whose elements are Group of Pictures (GOP), enclosed between two I-pictures, Pictures, Slices, Macroblocks (made of 4 Blocks) and Blocks (made of 8×8 pixels). 

seq-gop-pic-sli-mbl-blk

Figure 6 – Hierarchy of MPEG-1 Video

Slices are another departure from the Group of Block (GOB) structure of H.261. Three of the other technical changes are the increase in the motion estimation accuracy to ½ pixel, the removal of the loop filter, and different types of quantisation. 

The coding algorithm processes the 3D array of pixels corresponding to luminance. Pixels in one (x,y) plane at time (t) are organised in 8×8 blocks. If the picture is of type “I”, the DCT linear transformation is applied on all blocks and the resulting DCT coefficients are VLC-coded. If the picture is of type “P”, an algorithm tries to make the best match between a given macroblock at time (t+1) with one macroblock in the picture at time (t). For each macroblock, motion vectors are differentially encoded compared to the immediately preceding macroblock. Each block at time (t+1) is subtracted from the corresponding block at time (t) displaced by the amount indicated by the motion vector. The DCT linear transformation is then applied to the difference block. Motion Vectors and DCT coefficients are VLC-coded applying various tricks to reduce the number of bits required to code these data. If the picture is of type B, each block is interpolated using the available anchors. Different picture types, variable length coding and, obviously, the different amount of motion in different parts of a sequence, make the overall data rate variable. 

If the channel has a fixed rate, a first-in-first-out (FIFO) buffer may be used to adapt the encoder output to the channel. The encoder will monitor the status of this buffer to control the number of bits generated by the encoder. Changing quantisation parameters is the most direct way of controlling the bitrate. MPEG-1 Video specifies an abstract model of the buffering system, called Video Buffering Verifier (VBV), in order to constrain the maximum variability in the number of bits that are used for a given picture. This ensures that a bitstream can be decoded with a buffer of known size. 

Audio

The algorithm adopted for MPEG-1 Audio Layer I and II is a typical subband-coding algorithm, as represented in the figure below.

MPEG-1_Audio_encoder MPEG-1_Audio_decoder

Figure 7 – Components of the MPEG-1 Audio codec

PCM audio samples are fed into a bank of polyphase filters with 32 subbands. The filter bank decomposes the input signal into subsampled spectral components. In case of a Layer III encoder, a Modified DCT transform is added to increase the frequency resolution, which is 18 times higher than Layer II. Therefore the filtered or “mapped” samples are called subband samples in Layer I and II, and DCT-transformed subband samples in Layer III. A psychoacoustic model is used to estimate the masking threshold, i.e. the noise level that is just below the perception threshold, and this is used to control quantisation and coding block. An smart encoder knows how to allocate the available number of bits/block so that the quantisation noise is kept below the masking threshold. The allocation strategy and the psycho-acoustic model are not specified by the standard and therefore the model provides the means to differentiate between encoders from different manufacturers. MPEG-1 Audio only provides a very basic informative description of one psycho-acoustic model. 

The “bitstream formatting” block assembles the actual bitstream from the output data of the other sources, and adds other information (e.g. error correction) if necessary. The resulting data are then packed in a fixed-length packet of data using a bitstream structure that separates the critical parts needing high reliable transmission. There are four different modes possible. The first two are 1) single channel and 2) dual channel in which two independent audio signals are coded within one bitstream. The second two are 3) stereo in which the left and right signals of a stereo pair are coded within one bitstream and 4) Joint Stereo in which the left and right signals of a stereo pair are coded within one bitstream exploiting the stereo irrelevancy and redundancy. Layer III has a number of features that enable better performance compared to the lower two layers: it uses entropy coding to further reduce redundancy and a buffer to smooth out high variations in output bits, and more advanced joint-stereo coding methods.

At the decoder, bitstream data are read from a delivery medium specific decoder. The bitstream data are unpacked to recover the different pieces of information, and the bitstream unpacking block also detects errors if the encoder applied error-checking. The reconstruction block reconstructs the quantised version of the set of mapped samples. The inverse mapping transforms these mapped samples back into PCM samples. 

The figure below expands on the structure of the MP3 encoder.

MPEG-3_encoder

Figure 8 – Mere details of the MP3 encoder

CI would like to conclude this snapshot on MPEG-1 by highlighting the role that my group at CSELT had in developing the first implementation of a full MPEG-1 decoder. Already in 1984 we had started designing a multiprocessor architecture for video coding based on a bus interconnecting a set of multiprocessor boards. The first use of the board was done in 1986 to implement one of the first ISDN videophones. This actually became an industrial product by an Italian company and put in service. Each board featured four Analog Device 2900 Digital Signal Processors (DSP) and one Intel 80186 CPU that controlled communication between the DSPs and between the boards, because each board had in charge only a slice of the picture and data had to be passed to the different boards because of Motion Compensation requirements. 

As part of the CSELT work in the COMIS project, my group extended this architecture and implemented the first MPEG-1 Systems, Video and Audio decoding in real time and demonstrated it at the Haifa meeting in March 1992. It should be acknowledged that the MPEG-1 Audio decoding board had been purchased from Telettra.

comis_demo

Figure 9 – The COMIS demo

This was not just a technology demonstration because other partners in the COMIS project (BBC, CCETT and Olivetti) had developed content that included multimedia navigation support based on the MHEG standard, then still under development. The screen in the figure above shows an example of that content. 

I would like to close this chapter reaffirming that the MPEG-1 standard has been created by the concerted efforts of subgroup chairs during interminable sessions where thousands of technical contributions were debated, some natural leaders who sprang up and took the lead to resolve thorny issues and hundreds of people who took part in all these discussions. I created the environment for this to happen, they have made the standard.


The achievements of MPEG-1

Besides being an excellent set of pieces of technology, MPEG-1 is also a remarkable collection of “first ever” achievements. 

It was the first integrated audio-visual standard. This was a great achievement and did set an example to the media industry. For the first time a standard had been produced where the individual pieces were highly optimised because the best specialists in the field had developed them. Still the individual pieces fit well together, because during the 4 years it took to develop MPEG-1, countless “joint meetings” between the different groups had identified the issues that prevented smooth integration the three parts of the standard and smoothed out all the differences. This is a practice that continues to this day where at every MPEG meeting possibly tens of break-out groups and joint meetings involving two and sometimes more than two subgroups take place. MPEG-1 also set an organisational example for companies. Before MPEG-1, the audio and video groups in all standardisation bodies and most research institutions were usually allocated to different parts of the organisations, but today most of them are – as they should always have been – together from an organisational viewpoint. 

MPEG-1 was also the first standard that defined the receiver and not the transmitter. If the way information is encoded is undefined, obviously within the syntactic constraints of the standard, then the standard becomes a level play field where different manufacturers can compete and provide better and better encoding equipment. This will prolong the life span of the standard, whose obsolescence will be decreed only when the encoder optimisation scope will have been exhausted and a new, more powerful, standard can be produced using new research results. 

MPEG-1 was also the first standard designed to code the video signal independently of the video format (NTSC/PAL/SECAM). I do not claim that this was particularly relevant technical achievement, I only mention it because it was a policy decision of other Standard Developing Organisations (SDO) not have the foresight or – better – the courage to implement because of the highly political meaning attached to anything related to video formats. Indeed the digital version of these video formats has the same sample rate that can be generated by the same decoding device. The display issue is left out of the standard. 

MPEG-1 was also the first standard that was developed jointly by all industries with a current, or even expected, stake in the audio-visual business, overcoming their traditional entrenched interests.

MPEG-1 was also the first media-related standard that was developed entirely using software tools and also produced a reference software implementation of the standard.

Lastly, MPEG-1 was the first standard for which a quality performance was assessed at the completion of the work (actually, for MPEG-1 this was done only for the audio part). 

But next to the congratulations for the excellent technical work done and the number of records set, it is also important to make a dispassionate analysis of how the original business goals that the companies had with MPEG-1 have been achieved through the standard. 

The driving idea of MPEG-1 – interactive video applications on compact discs – was a very natural move. Jointly launched on the market 5 years before by Philips and Sony, CD Audio was (and still is, through its successors DVD and Blue-Ray) a roaring – if nowadays declining –

success. The specification of CD-ROM as a computer peripheral (ISO/IEC 9660) that enabled users to have hundreds of Mbytes accessible from their computers was already completed. This was something like a dream at a time when hard disks had a capacity of (very) few tens of Mbytes. Everybody believed that, if only digital video could be brought down to a bitrate that could fit in the 1.41 Mbit/s transfer rate of the CD (or 1.2 Mbit/s of the CD-ROM, single speed at that time!) and preserve a sufficiently high quality, great opportunities were waiting for CE devices, telecommunication terminals, PC peripherals, etc. The typical stand-alone device was what eventually became Compact Disc Interactive (CD-i), the interactive video device par excellence manufactured by Philips. 

With the addition of audio coding work in the MPEG work program, CD-i like devices could benefit from better audio (the original CD-i specification had quite a primitive form of audio), but exciting audio-only applications could be imagined as well. The first idea was to replace inexpensive audio-only analogue recording devices such as the Compact Cassette (CC) players with equally inexpensive analogue and digital recording devices that still used the CC mechanics as the recording medium. Eventually this idea became a product called Digital Compact Cassette (DCC), manufactured by Matsushita and Philips. The second idea was to introduce a new fully-digital audio broadcasting system, the target of Eureka 147 DAB that was eventually deployed in Europe, Canada and other countries. 

The reader who happens not to know any of these three acronyms – CD-i, DCC and DAB – should not feel embarrassed. The first product was discontinued several years ago. The reasons are manifold, but the primary one is that periodically the IT mermaid sings her interactive song and some companies get fancied by it. But when the mermaid achieves that, she casts them aside. The second product looked like a great idea: billions of compact cassettes used to be sold but the quality was not what ears, accustomed to compact disc, would want. What if we had a system where people could keep on using the same carrier – the cassette – as the old analogue device but also record and play back compressed digital music at a quality indistinguishable from the CD’s? No way, consumers did not buy it, and the reasons were not technical. The third, broadcasting of digitally compressed studio-quality sound, also looked like a new lease of life for good old radio. The reality of today, several years after the service has been launched, is that other forms of digital radio have taken hold, but DAB is not having a prosperous life. 

So much for the ability of industries to guess what consumers want and provide SDOs with precise directions about what standards they need. If MPEG had designed its MPEG-1 Video and Audio coding for those specific industry needs, probably the name MPEG, if still existing, would not be linked to the idea of products based on successful standards. If it does, it is because MPEG, while valuing industry inputs, made its best efforts to develop “generic” standards by abstracting its requirements from the specific industry requests of the time. 

MPEG-1 is a successful standard. Video Compact Disc (VCD) is a product that plays linear video recorded on a CD as MPEG-1 with a quality comparable to VHS’s. Hundreds of million VCD players have been sold, especially in the Far East and many billion VCDs have been printed. MPEG-1 was the first audio-visual format for the Personal Computer. Since Windows 95 all versions of Windows have had an embedded MPEG-1 software decoder. Even portable video cameras recording in MPEG-1 were manufactured. MPEG-1 Audio Layer II is used in hundreds of million digital television Set Top Box receivers. An entire new industry, the VLSI industry for digital audio and video was created by MPEG-1.

Lastly I must mention MPEG-1 Audio Layer III, aka MP3. Billions of people use it and it would probably require a big effort to identify all companies manufacturing hardware or software MP3 players. But MP3 is another story. Like Mark Twain, I am not going to tell you the story this time, but, unlike him, I will just keep it for a later page.


The Highs And Lows Of Television

For some 80 years, one-to-many communication seemed like a business blessed by the blindfolded goddess. Newspapers used to be of the first and, until recently, a very successful such businesses. Coveted, pampered, feared, lured, controlled or suppressed by Public Authorities, they are incredibly powerful tools to shape public opinion – if content entices the public to purchase them in great numbers. Television, another one-to-many communication business, came to the fore much later, but is said to have overtaken much of the role that used to be played by newspapers (and radio). It looks as if somebody buying a newspaper or listening to a radio program or watching television were to say: I am opening my mind to you, would you like to fill it with your messages? 

Adding pictures to newspapers has been a great tool to get people’s attention, but moving pictures are much more powerful. To understand the reasons for its success in so many layers of the population, suffice it to look at an infant intent to watch television. What else is needed to prove success beyond the billion television sets that are in current use worldwide? This large number is all the more remarkable if one thinks that terrestrial television broadcasting is a very complex system whose deployment requires huge investments. In addition to what its takes to generate the pictures, it requires the installation of transmitting towers, which must be placed in line of sight because they use comparably high frequencies: VHF of about 100 MHz and UHF of a few hundred MHz. This is a technical nightmare in countries like Italy and Japan because of the mountainous nature of their territories. 

The political environment has always dispensed loving care to television because of the obvious cultural, educational, entertainment, informational and political value of the medium. Until a time, many countries had a single television broadcasting agency or formally private company closely supervised, if not directly run, by the state. Many countries still require paying a viewing license to receive “public” television programs. It is not that many years since a few of the countries that used to have only state-run television agencies have allowed the establishment of “commercial” television companies. In order to be allowed to broadcast, these other companies must obtain a licence from the state because they use a portion of a limited-availability asset – VHF/UHF bandwidth – but they do not get money from a licence fee. So these companies have to be creative to be profitable, typically by resorting to advertisements in their programs or, more recently, offering pay TV in Over-the-Air broadcasting. 

The analogue television industry has been remarkably stable. The only significant innovation since the early years has been the introduction of colour in the 1950s (USA) and 1960s in Europe and Japan. Because TV is a full-fledged or at least kind of “public service”, owners of monochrome TV sets could not be disenfranchised by the introduction of an incompatible television service replacing the old monochrome service. Colour had to be introduced in a compatible fashion, i.e. in such a way that an existing television receiver, capable only of receiving monochrome television is still able to receive a colour signal and display it as a monochrome signal. In the delivery domain, one can mention the use of CATV, and the use of satellite broadcasting as examples of innovation, not to mention the many innovations that recent years have brought to consumers. 

I said “the only significant innovation”, but I may have gone too far. In the late 1960s, Nippon Housou Kyoukai (NHK), the public Japanese broadcaster, started the development of a new generation of television system, called High Definition Television (HDTV). The system had approximately double vertical and horizontal definition compared to Standard Definition Television (SDTV) and an aspect ratio (the ratio of horizontal vs. vertical dimensions of the screen) of 16:9 instead of the standard 4:3 aspect ratio of Standard Definition TV. Therefore the bandwidth required by HDTV was roughly 5 times that of SDTV. NHK selected the frame frequency of 30.00 Hz (sharp) interlaced and 1125 vertical scanning lines, 1035 of which are active (i.e. information carrying) lines. No spectrum had yet been allocated for transmission of such a broadband signal, but the system immediately caught the attention of broadcasters around the world. 

In the meantime, NHK engineers were working hard to develop a compression system called Multiple Sub-Nyquist Encoding (MUSE), a digital system able to compress the analogue HDTV signal intended for analogue transmission to fit in the satellite bandwidth of an SDTV program. This piece of work was truly admired by the scientific world. Actually, only a part of that world, because the digital purists – the majority – disliked the idea of doing the processing using digital techniques and the transmission with analogue techniques (as if all analogue delivery systems carrying bits did not also do the same). In the early 1980s, the word-of-mouth in the business was that Japan and the USA would team up to conquer the entertainment world with HDTV, the former with their control of the technology and their manufacturing prowess and the latter with their overwhelming capability to produce content suitable for this renewed television experience. 

In the same years, the EBU had been working on an alternative project called Multiplexed Analogue Components (MAC). The project was inspired by the fact that, while the visual experience of NTSC and PAL/SECAM in the studio is similar, the spatial resolution of the latter two is significantly superior (by about 20%). If the resolution loss caused by interlace (so-called Kell factor) could be compensated by, say, doubling the frame frequency, one would obtain a system that was virtually flicker-free, thereby almost reaching the effective vertical resolution of interlaced HDTV. The advantage of this approach was that the television format used in the display (625 lines @ 25 Hz) would be preserved but the visual experience would be free from the typical PAL (and NTSC) artefacts, i.e. the mixing of colour and luminance information caused by the insertion of colour information in the holes of the luminance spectrum. 

As for HDTV, which required the start of a new broadcasting service via satellite, whose programs could not be received by existing television receivers, MAC also required a special Set Top Box (STB) capable of decoding the signal. However, unlike HDTV that required new and very expensive monitors, the output of a MAC STB could be used to feed a conventional television monitor preserving the purity of the signal if the set had a connector conforming to the Syndicat des Constructeurs d’Appareils Radio et Télévision (SCART) specification that supported component signals. 

For years the ITU considered proposals for an international HDTV standard. The commendable idea was that – this time – the broadcasting industry should do away with national roads to television. There should be a single standard replacing the plethora of national television standards that would finally unify the broadcasting world. The watershed – not exactly in line with the expectations – happened at the ITU General Assembly of 1986. On that occasion a coalition of European PTT administrations, supported by a European Commission flexing its muscles for the first time in the CCIR arena, stopped the agreement by proposing a “new version” of HDTV. The main features of the system were: 25 Hz frame frequency, 1250 scanning lines (twice the number of PAL/SECAM’s) and a transmission system called High Definition MAC (HD-MAC) that exploited the MAC multiplexing functionalities for a transmission system, obviously incompatible with MUSE, even though it shared a number of technical principles with it. The major departure from MUSE, and a much-publicised feature impacting the user was that a D2-MAC receiver (one of the many variants of the MAC family and the one the European broadcasting industry had eventually converged to) would be capable of decoding a D2-MAC signal from an HD-MAC signal. This solution would have allowed the continuation of the deployment of D2-MAC, whose plans at that time were already quite advanced, at least in some countries, while the content and manufacturing industries, under the protective wings of the European Commission, would gear up to provide the next step in television without jeopardising the already deployed population of receivers. 

The work for developing the entire equipment chain for eventually introducing the service was funded by Project 95 (EU 95) of the Eureka R&D program. About 1 billion USD went into that project that was technically very successful because it developed a range of products going from HD studio cameras and recording to transmission and receiving equipment. They were deployed and tested successfully in great number during the winter Olympics of 1992, the first to receive a full HD-MAC coverage. 

Even more intense work was taking place in Japan. The advanced state of development of the technology allowed Japan to start a trial broadcasting service of 8 hours a day in 1989. During the decade that the trial service lasted, about half a million MUSE HDTV decoders were deployed. 

The failure of ITU to adopt the HDTV Recommendation in 1986 had also given a blow to the HDTV plans in the USA. Dubbed as Advanced Television, the new project for the American path to HDTV was kicked off in the 1986-87 time frame when the National Association of Broadcasters (NAB) asked the Federal Communication Commission (FCC) not to re-allocate the spectrum, already assigned to broadcasters, to cellular telephony. The FCC complied and created the Advisory Committee on Advanced Television Services (ACATS). At that time, the prevailing view was that 12 MHz of spectrum, i.e. two television channels, would be needed to deliver HDTV. One 6 MHz channel would be used for the NTSC as the “base layer” and another for an HDTV “augmentation” signal. In this way the magic “compatible” extension from TV to HDTV would be achieved. Proposals were requested to show that this was feasible with the intention of selecting a system for the USA market. In a curious twist of history, the original American HDTV plan had started from the discontinuity advocated by the Japanese, just to end in the evolutionary Europeans approach of progression from TV to HDTV. But this was not going to be the end of the story. 

For obvious reasons, I had kept myself informed of what was brewing in the CCIR for HDTV, and the failure of the ITU General Assembly to approve the HDTV Recommendation did not come as a surprise. In my years in CEPT, ETSI and CCITT committees I had plenty of opportunities to see what explosive combinations technology and politics could produce. But politics in CEPT and CCITT was incomparably less sophisticated than in CCIR because, at that time, telcos felt protected in their own local markets.

Out of reaction to the demonstrated lack of results caused by the inextricable mixing of politics and technology, I had begun to develop my own philosophy, prompted by my experience of too many smart people who had gone nowhere just because they had tried to manage both sides – political and technical – of the equation. So the recipe that I developed was: be aware of and conversant with the political issues but concentrate on the technical side. If the latter was successful, politics – with the innumerable nooses that its players would create for itself – would eventually have to bow and accept the results of technology. But, to be successful, the technical environment had to involve individuals from all parts of the world – lest we had a repetition of the IVICO experience – and, possibly more important, from all technical communities working for the different industries. 

The 1st HDTV Workshop, held at L’Aquila on 12-13 November 1986, just a few weeks after the ITU General Assembly, saw the participation of the major technical players in the HDTV space. A few months later a Steering Committee (SC) was set up that was, in the words of the “Guidelines for Steering Committee Activity” that would guide the organisation during 14 years,

made of people representing the three global regions and technical communities on an equal footing.  

That was my first experience in trying to put together people from different countries and technical communities to achieve a common goal with an eventual business objective that should be “on the horizon”, but should never allow to get in the way of the different business objectives of the different people representing different industries – not to say companies. That experience taught me many lessons that I would use in the following years, apart from giving me the opportunity to know a number of new people whose friendship I have preserved over the years. The only big regret I have is the loss of Yuichi Ninomiya, the inventor of MUSE at NHK and an SC member since the early days, who suffered an untimely death.

The HDTV Workshop continued for 14 years organising yearly conferences. Eventually, however, the launch of the Advanced TV service in the USA, the trial HDTV service in Japan and the virtual neglect of anything “HD” in Europe after 1992, made the HDTV Workshop redundant. The last event was held in Geneva in 1999, but I had already left the workshop in 1994, at the peak of my MPEG-2 and DAVIC efforts to which I have now to turn. 


The Digital Television Maze

At the very first meeting in Ottawa, MPEG had set for itself an initial work plan based on three targets: substandard definition television as provided by SIF (1/4 of SDTV) at up to about 1.5 Mbit/s, standard definition television at up to 5 Mbit/s and high definition television at up to a bitrate still to be determined. The first two years in MPEG were devoted to build the organisation, establish the foundations of the MPEG-1 project and provide the technical elements on which the first standard would be established. But with the successful execution of the 1989 Kurihama tests and the transition from the competitive to the collaborative phase at the Eindhoven and Tampa meetings, I could afford to start thinking of the next step. The Turin ad hoc group meeting of MPEG in May 1990 provided the opportunity to set things in motion. On the evening of the first meeting day, D. Le Gall, T. Hidaka, A. Simon and I had dinner together at the Atlantic Hotel in Turin to discuss the steps to be undertaken for the next phase of work. 

Disguised, as it might appear under the title “Coding of moving pictures for Digital Storage Media having a throughput of 1.5-5 Mbit/s”, the purpose of the project was very clear for broadcasters: MPEG intended to develop a digital television standard starting from the source coding part. This was a very important issue as is clear from some simple arithmetic. Assume that a television channel in the VHF or UHF band occupies a bandwidth of 8 MHz, as is common in Europe (in some PAL/SECAM countries the value is sometimes 7 MHz, while in NTSC countries the bandwidth is 6 MHz) and that a broadcaster decides to offer television programs in digital form. Applying the rule of thumb of 4 bits/s/Hz (i.e. the bandwidth of 1 Hz carries 4 bits/s) one gets a gross bitrate of about 30 Mbit/s. Interesting, but not a big deal, if one considers that digital television per ITU-R Recommendation 656 has 216 Mbit/s, and one would then need to use seven analogue television channels to carry just one digital television program. 

Fortunately video compression comes to help. I have shown before how simple DPCM methods can reduce the bitrate to about 70 Mbit/s – again useless for broadcasting – but video compression experts are very smart people and they have invented all sort of tricks to bring the bitrate down to about 30 Mbit/s still using DPCM (of course this does not mean that audio compression experts are not smart, simply they are more obsessed with quality because the ear is less tolerant to distortion than the eye). But even so, there is not much to gain by replacing an analogue system that works well now and is deployed in billion TV sets with another, incompatible system just for the sake of making it “digital”. But if more sophisticated transform-based algorithms are applied, a reduction of some 40 times in bitrate becomes possible and a broadcaster can even squeeze something like 6 TV programs in a UHF channel. 

“What for?” is a question I will deal with later. Continuing with the idea of squeezing more TV programs in the same UHF channel, the problem was that, while designing VLSI chips based on DPCM for video signals looked feasible since early times, designing a chip employing DCT and Motion Compensation, even for ¼ of TV resolution like SIF, was still a challenge in the early 1990s and the VLSI industry had some hard times trying to make MPEG-1 decoders and, more so, encoders. The complexity remained in spite of all the care that had been put in designing the MPEG-1 standard, and the ability of VLSI designers to design MPEG-1 chips was stretched to its limits. The efforts made for MPEG-1, however, did pay off for MPEG-2 and MPEG-1 can well be considered the stepping-stone for the next, more ambitious – and more economically rewarding, on the video side – MPEG-2 standard. 

The business of satellite broadcasting was even more receptive to the idea of multiplication of channels promised by digital technologies. Ten satellite transponders have a high cost and still they only provide a limited choice. But what if they became the vehicle for 60 TV programs? Apart from the obvious multiplication of choice for subscribers, one could think of so-far unthinkable new ways of serving subscribers: the same latest hit movie broadcast simultaneously at staggered times, so that subscribers do not have to wait two hours for the movie to start again; or serving different subscriber communities with diversified interests, etc.

A similar incentive existed for CATV operators. Here several tens of different programs were already the norm since several years past, but the use of digital television could enable operators to broadcast hundreds of different programs – 500 was the magic number mentioned by one media tycoon in the first half of the 1990s that caught the fantasy of reporters of that time. If one considers that a cable serves between a few hundred to a few thousand subscribers, it is clear that digital technologies could enable program offerings fine-tuned to match the wishes of the most exacting subscribers. 

So far we have talked of the benefits – or supposed such – for incumbents, but there were other business players stomping their feet to enter this business: the telcos. Stuck for more than a century with a much-in-want and profitable service – telephony – but just that one (indeed, in spite of the early attempts to make it “another” communication service, facsimile turned out to be equivalent for all practical purposes to a phone call), they had always wanted to enter the business of sound and television distribution. The former had never been considered very attractive, but the latter was much coveted, even though, or maybe because – as I have already said – it was thought to require a major overhaul of their networks. 

Attempting such an overhaul was not impossible. In the highly regulated telecommunication landscape of the Federal Republic of Germany of the 1980s, Deutsche Bundespost Telekom (DBT) was the only concern allowed by the Federal Law to operate programme exchange between studios but also distribution between studios and broadcasting towers. Therefore laying and operating CATV was DBT’s exclusive prerogative and they exploited it quite successfully. But this is maybe the only big CATV-related success story in the regulated telecommunication world, if one does not consider the less successful story of the French Direction Générale des Télécommunications (DGT). Possibly this was caused by the fact – with few similar cases in the world – that the government itself was running the telecommunication business through a branch of the PTT ministry. 

Telcos had always wanted a total overhaul of their network, even though an extension of their business such as CATV was their second best option. The technology existed – optical fibres – what was missing was 1) a great idea for a service that people would be willing to buy and, of course, 2) to have access to the money required to deploy the new network in the hope of a Return on Investment (ROI) in a time not measured in geological ages. What better idea than millions of television programs and videos stored in gigantic servers, that subscribers could access from the comfort of their homes using the new broadband digital network? MPEG-2 was the obvious enabling technology component that would have provided an effective way to store, transmit and consume television programs in digital form by using a suitable combination of fibre-based broadband technologies, together with other “shortcuts” like cable or ADSL. 

Before moving on, I must add the CE and IT industries to the list of prospective customers of the MPEG-2 standard. Apart from their obvious interest in making new types of television receivers obeying the wishes of their broadcasting “masters”, the CE industry was definitely interested in transforming analogue video recorders to digital form. They could even think of using some disc-based recording technology to make some new types of video recorder. The IT industry, instead, was interested in the new MPEG-2 enabled infrastructure with the expected mass deployment of IT-based infrastructure – both hardware and software – replacing traditional television equipment. Lastly there was the growing IC industry, waiting for the new business opportunities created by MPEG-2 encoders and decoders. 

Obviously, the world had not been waiting for MPEG to discover this wonderful opportunity. If nothing had happened before, it was because of the political stalemate in the CCIR for anything altering the existing television balance, the industry-specific interests of the CCITT and the IEC, and the fact that each of these bodies were populated by a single industry, whose members were more concerned by the possibility that a competitor would gain an advantage, than with the prospects of all sharing (I mean, after fighting for) a new common business opportunity. 

By virtue of being new, shielded from politics, populated by all sorts of industries and application-agnostic – in addition to having shown its effectiveness in the (at that time still being developed) MPEG-1 standard – MPEG was looked at by many as the body that could deliver the holy grail of digital television in the form of a generic standard that all industries could exploit.


MPEG-2 Development – Video

The first open MPEG-2 “session” took place in Porto in July 1990. Months before, while in search for a meeting host for this ever-growing MPEG group, I had asked Prof. Artur Pimenta Alves of INESC to host the July 1990 meeting during a COST 211 working dinner at Ipswich, near British Telecom Laboratories. He kindly (and boldly) agreed and the meeting took place in the brand new Hotel Solverde at Espinho, a few kilometres south of Porto on the Atlantic shore. 

At that time, MPEG members were still struggling, trying to put together the pieces of the MPEG-1 Video standard. The design of MPEG-1 Systems had not even really started and the Audio group would be meeting two weeks later in Stockholm to assess the results of the tests made on the submissions. But a diverse group of individuals found the time to attend the first MPEG-2 session – brainstorming on requirements for the “second phase of MPEG work” – as MPEG-2 was shyly called at that time. Some discussions were also made on the 5 Mbit/s limit of the new work item, an anticipation of things to come. A concrete result was that the limit was moved to 10 Mbit/s, and not because the group did not feel confident to compress digital television down to 5 Mbit/s, but for another – obvious – reason.

The Porto meeting is also worth remembering for the social event offered by the host. The place was – surprise? – The Sandeman Caves. When I was making my, by then already traditional, dinner speech, some people asked me to sing “O sole mio”. My counterproposal, in deference to the host, was “Coimbra”, a song possibly of the same age. Eventually I settled for the traditional Japanese song 荒城の月 (Koujou no tsuki, Moon over the ruined castle). My speech ended with myself singing the song amidst a group of Japanese delegates, Hiroshi Yasuda being one of them. Regrettably, MPEG people have waited for the requested performance of my supposedly native song. 

The successful kick off of the MPEG-2 work and the apparent neglect of such a promising future standard in the European scenario of that time, prompted me to use the COMIS project environment as a launch pad for another European project devised to foster European participation in the MPEG-2 work. The selected vehicle was the Eureka research programme. The reason was twofold, the first because there were no Calls for Project Proposals forthcoming in both RACE and ESPRIT programs and the second because the people who had sank the IVICO project were still circling around undeterred. Unlike the other CEC funded R&D projects that followed a defined program of work and in response to a Call at a given time and revised by reviewed appointed by the European Commission, Eureka projects could be set up on anything and proposed at any time by any consortium involving at least two companies from two different European countries. Projects had to be submitted to a committee of European government representatives, the support of two of them being sufficient for approval. The shortcoming was that funding did not come from Brussels but from national governments, each of which applied its own funding policy (sometimes meaning no funding at all). 

With an understatement, I could say that an independent Eureka proposal was not universally greeted with approval. The French government, the spearhead of the European policy of evolution of television via the analogue path, was particularly opposed to the idea. I had to go and pay visit to a “fonctionnaire” of the Ministère de l’Industrie et de l’Aménagement du Territoire, who had the authority to vote on project approvals on the Eureka Board, to explain the case. The project was eventually approved with the title Video Audio Digital Interactive System (VADIS) or Eureka project 625 (incidentally the project serial number had the same number of lines as European TV – a number not given by design!). VADIS became the channel through which a coordinated European position on MPEG-2 matters at MPEG meetings was prepared and proposals discussed and coordinated. I was appointed Project Director and Ken Mac Cann, then of the Independent Broadcasting Authority (IBA) in the UK, was appointed Technical Coordination Committee Chair, later replaced by Nick Wells of the BBC. A Strategic Advisory Group of senior executives from the major member companies, chaired by Cesare Mossotto, then the Director General of CSELT, was also established. 

At the first meeting in Santa Clara, CA in September 1990, hosted by Apple, I met Sakae Okubo, the Chairman of the CCITT Experts Group on Video Coding. After some discussions, we agreed that he would be the best choice for chairing what was immediately christened as the “Requirements” group. Some requirements activity had indeed been running since the beginning of MPEG, a very important activity indeed, because it served the need of defining the features of a standard that would satisfy diversified industry needs. Until that time, however, there had been no opportunity to raise that activity to the right level of formality and visibility. For MPEG-2, with the kind of wide range of aggressive interests, some of which I have described above, it was indispensable to formally institute the process of identifying requirements from the different application domains, lest the technical work be subjected to all sort of random non-technical pressures. The eventual successful MPEG-2 development owes much to Okubo-san’s capabilities displayed in the 4 years of his tenure as chairman of the Requirements group. In this effort Okubo-san was well supported by a number of senior individuals such as Andrew Lippman, Associate Director of the MIT Media Lab and Don Mead, then with Hughes Electronics, the company that would later develop and launch DirectTV in the USA. 

Having Okubo-san as an MPEG officer helped achieve another, formally very important, goal. At the time, it seemed that the Information and Telecommunication Technology (ICT) future would depend on Open System Interconnection (OSI) for which JTC 1 and CCITT had agreed on collaboration rules. This had to be done because OSI was a joint ISO and CCITT project and it was felt important for the two bodies to have a proper collaboration framework. Okubo-san’s double role made application of those agreements easier. So the Systems and Video parts of MPEG-2 were made “joint projects” between JTC1 and ITU-T with the intention to publish the standards jointly produced as “common text”. This practically meant that there was going to be a single joint JTC 1 – CCITT group that included MPEG Systems and MPEG Video on the JTC 1 side developing the Systems and Video parts of the standard. The integration of work was so deep that ISO/IEC 13818-1 (MPEG-2 Systems) and ISO/IEC 13818-2 (MPEG-2 Video), registered in ITU as H.262 and H.222.0, respectively, are the same physical documents. 

The preparation of the MPEG-2 work, in Okubo-san’s capable hands, progressed well. Hidaka-san took care of developing appropriate testing procedure for the tests on video submissions. Didier and Hans (and later Peter) were asked to start looking away from the ongoing work in MPEG-1 Video and Audio and provide input to the MPEG-2 Call for Proposals (CfP). 

Two times as many submissions (32) were received and tested in response to the MPEG-2 Video CfP at the second Kurihama meeting kindly hosted by JVC in November 1991. Participants were almost twice as many compared to 2 years before. Remembering the Bay area earthquake of two years before, that week many people in the group held their breath waiting for some Force of Nature to manifest itself somewhere in the world, but nothing happened. The result of the test provided a wealth of technical inputs, in particular for the most crucial feature required, viz. the ability to encode interlaced pictures. More features, however, were waiting for appropriate consideration. 

In the European environment, terrestrial broadcasters – a handful of them represented in the VADIS project – were facing the fact that in a couple of years there would be a digital television standard, expected to be widely supported by manufacturers. As an industry they were not necessarily opposed to digital television, but they had their own views of it. They wanted a standard that would be represented by a “hierarchical” or “scalable” bitstream. Such a scheme would have allowed a full-resolution receiver to decode the full bitstream; while a lower-resolution receiver would only decode a subset of the same bitstream. It was more or less the same concept that had been pursued in Europe with D2-MAC and HD-MAC and in the USA in the first phases of the Advanced TV process, but in the digital domain. 

It is a general rule that business and politics do not carry over unaltered when crossing oceans to other continents. In 1990, General Instruments (GI), eventually to become part of Motorola Mobility, only to be separated again and renamed ARRIS and finally changing hands, had changed their analogue proposal in the ATV process to a full-digital system based on a compression scheme derived from Digicypher, which they were already using for compressing video for transmission over satellite cable feeds to increase the number of programs transmitted per transponder. The exciting news was that the GI proposal required only one 6 MHz channel to transmit HDTV. In short order, three of the four remaining contestants in the ACATS competition (the fourth being NHK with a scaled-down version of MUSE so that it could operate within 6 MHz) announced that they would also switch to digital. 

Then the FCC changed the rules and announced that broadcasters would be given a second channel to transition to Digital HDTV and, when transition would be completed, the original NTSC spectrum would be returned to the government, a step called “digital dividend”. The four digital submissions were tested and found to perform similarly. Therefore ACATS suggested that the proponents develop a unified system in the framework of what was called the “Grand Alliance”. 

At the MPEG ad hoc group meeting in Tarrytown, NY in October 1992, hosted by IBM, the first steps were made toward the eventual confluence of the Grand Alliance into MPEG-2. Soon after Bob Hopkins, then the Executive Director of the Advanced Television System Committee (ATSC), a body with a role comparable to NTSC’s 40 years before and a member of the HDTV Workshop Steering Committee, started attending MPEG meetings. 

The interesting conclusion was that the USA had started from a “compatible” solution to end up with opposing scalability. The request was now to have the best HDTV picture possible because digitisation of a 6 MHz UHF channel using the 8VSB modulation selected could only provide about 20 Mbit/s. 

To add more confusion, but showing that – at least until that time – their business did not know latitudes and longitudes, telcos were generally keen to have the hierarchical feature in the standard. This feature was indeed good for some ATM scenarios where information would be transmitted in packets, some marked with high priority and carrying the lower resolution part of the signal and other marked as lower-priority packets and carrying the high-resolution differential. In case of congestion, the former packets were expected to go through because they were set as high priority, while the latter could be discarded. Users would experience a momentary loss of resolution, but no disruption of the received programme. 

When talking of higher and lower resolution, however, broadcasters and telcos probably meant different things. The former certainly meant the HDTV and SDTV duo, while the latter more likely meant the SDTV and SIF duo, at least when talking of videoconferencing. These two views made a lot of difference in business terms, but very little difference in technical terms.

Slowly, the realisation that there were no technical reasons to put a bitrate limit to the second MPEG work item and that the third one – HDTV – was not really needed, at least for the foreseeable future, made headway in the group. But in general, talking of HDTV was anathema to Japanese members, even though the global ramifications of their manufacturing, distribution and commercialisation made the perception of the gravity of this heresy somehow dither depending on the circumstances. 

The hurricane started gathering force at the Haifa meeting in March 1992 where a resolution was approved inviting 

National Bodies to express their stance towards the proposal made by the US National Body concerning removal of 10 Mbit/s in the title of the current phase of work

I am not particularly proud of this resolution, not because of its ultimate goal, which was great, but because technically it made no sense.  Indeed, MPEG-1 had already shown that, no matter how much reference one would make to bitrate or picture size, the standard was largely independent of them within a wide range of bitrates and picture sizes, because it was just a signal processing function that considered three 3D arrays of pixels (luminance and colour differences) as input to the encoder and output of the decoder. It was easy to trade bitrate against picture size because substantially the same algorithm could be used in different bitrate ranges with different resolutions. 

The hurricane became a tornado at the Angra dos Reis, RJ meeting in July when Cliff Reader, then with Cypress Semiconductors and head of the US delegation since the Paris meeting in May 1991, presented a National Body position asking to fold the third MPEG work item into the second. That was one of the cases in my career when I started a meeting without knowing what would happen next – and found myself still firmly in the saddle at the end of the meeting. So MPEG-3 got killed at the 1992 Brazil meeting. 

The political problems had been solved and MPEG could proceed, but the technical issue of hierarchical vs. non-hierarchical (à la MPEG-1) coding was still waiting for a technical resolution. In the best MPEG tradition the decision had to be made on the basis of hard technical facts. No one disputed a hierarchical solution could be clean and useful, but simulation results did not bring convincing proof that, with the technology on the table at the time, there were sufficient quality gains to justify the added complexity, and therefore the cost, arising from the hierarchical functionality, both in terms of design effort and in square mm of silicon in an IC. 

Indeed the results showed that, at a given bitrate, there was little of any gain from a hierarchical bitstream compared to so-called simulcast, i.e. two bitstreams whose combined bitrate is equal to the hierarchical bitstream. Some gain could be obtained only if the lower resolution component used at least one half of the total of a hierarchical bitstream, but this was not very practical in most scenarios, e.g. broadcasting and ATM. Still MPEG had pledged, in its MPEG-2 requirements document, to provide a solution to all legitimate requests coming from industry. How to manage the situation? 

In the time between the Haifa and Angra dos Reis meetings I happened to read some OSI documents that UNINFO, the Italian National Body (NB) in ISO, had kindly forwarded to me as a member of the national JTC 1 committee. I found TR 10,000 particularly enlightening where it defined the notion of “profiles” as: 

sets of one or more base standards, and, where applicable, the identification of chosen classes, subsets, options and parameters of those base standards, necessary for accomplishing a particular function. 

This sounded great to my ears. By interpreting “base standards” and “chosen classes, subsets, options and parameters of those base standards” as “coding tools”, e.g. a type of prediction or quantisation, the MPEG-2 Video standard could be composed of two parts. One would be the collection of tools and the other a description of the different combinations of tools, i.e. the “profiles”. One MPEG-2 video profile could then contain basic non-hierarchical tools and another profile the hierarchical tools. 

Of course that solution would not be perfect but just good enough. Those in need of simple decoders would not be encumbered by costly hierarchical tools that they did not need and considered ineffective. Those who needed these tools did not have to design an entirely new chip because they could go to manufacturers of non-hierarchical chips and ask them to “add” the extra tools to the design of their decoder chips. Perfection would not be achieved, though, because not all MPEG-2 Video decoders would understand all MPEG-2 Video bitstreams.

So at the September 1993 meeting in Brussels, the decision was made to “adopt the following general structure for the MPEG-2 standard, 

Video part: profile approach, the complete syntax followed by the Main profile and all those profiles that WG11 will decide to specify. 

Eventually things turned out to be less straightforward and several profiles were designed. One, called Simple Profile did not support B-pictures (interpolation). This feature was too costly to support at the time the profile was defined (1993) because it required a lot of memory that was considered particularly costly by people who wanted cheap set top boxes the next day. Then the Main Profile (MP) contained all tools that would provide the best quality with no hierarchical features. Then there were three scalable profiles: “Signal-to-Noise (SNR) scalable”, “Spatial scalable” and, lastly, the “High” profile that contained the collection of all tools. 

The really nice side of this story is that these profiles are all hierarchical, in the sense that if “>” is used to mean “is a superset of” one can say that the MPEG-2 Video profiles obey the following relationship: 

High > Spatial Scalable > SNR scalable > Main > Simple

This can be seen from the figure below (see later for the two profiles at the bottom).

MPEG-2_profilesr

Figure 1 – MPEG-2 profile architecture (the 4:2:2 and Multiview profiles were specified later)

Profiles, however, solved just one part of the problem, the “functionality” part. e.g. being scalable or not. What remained to be solved was the need to quantise the “resource” scale, i.e. mostly bitrate and picture size, in some meaningful, application-dependent fashion. The adoption of “Levels” helped solve this problem. The lowest level was called, indeed, Low Level, and this corresponds to SIF, Main Level (ML) corresponds to Standard Definition TV, “High1440” and “High” correspond to HDTV with 1440 and 1920 samples/line, respectively. 

This partitioning allowed MPEG to get rid of some otherwise intractable political issues and has also provided a practical way of partitioning services and receivers in a meaningful way. In retrospect one could say that the MPEG-2 Profiles and Levels were just the formalisation of two solutions already adopted in MPEG-1. Indeed, Layers in MPEG-1 Audio correspond to Profiles (with the same meaning as above, Layer III > Layer II > Layer I) and the Constrained Parameter Set (CPS) is the only Level defined for MPEG-1 Video. 

The performance of MPEG-2 Video was tested with the Verification Test (VT) procedure already applied for MPEG-1 Audio. These were conducted in the following way:

  1. A set of test sequences was agreed on and distributed to those participating in the tests
  2. Each participant used their own proprietary encoders to encode the test sequences and delivered the corresponding bitstreams
  3. Encoded bitstreams were decoded using a standard decoder
  4. Decoded pictures were subjectively tested. 

It turned out that at 6 Mbit/s the quality of encoded pictures was subjectively equivalent to the quality of composite television (PAL and NTSC) in the studio. At 9 Mbit/s the equivalence was with component (RGB) television in the studio, an interesting result if one considers that MPEG-2 MP does not support 4:2:2 sampling but only 4:2:0 (i.e. at every luminance line the colour differences alternate), a consequence of the fact that humans have lower visual acuity of colour compared to luminance.

The existence of the sophisticated MPEG-2 Video technology triggered the interest of the professional video industry. The proposal to develop what was eventually called the “4:2:2 profile” came from a group of video studio product companies that requested the development of a new profile that would be capable of dealing with video in the native 4:2:2 resolution in compressed form at bitrates of some tens of Mbit/s. The idea was that the technologically most expensive part – the encoder/decoder function – would become much less expensive if it could exploit mass production of MPEG-2 Video chips. 

A second proposal was less successful. So far all the work in video coding had been done under the assumption that A/D conversion was done with a linear 256-level (8-bit) quantisation scale. Indeed this level of quantisation was more than adequate for end-user applications, because the eye cannot resolve more than 256 levels. However, if some picture processing is performed in the digital domain, this accuracy is easily lost. Because of visible quantisation errors, ugly-looking pictures are generated when final samples are produced. To avoid this, it is necessary to start with a higher number of bits/pixel, say 10 or 12. MPEG-2 Video had been designed for 8-bit operation and its extension to 10 bit was not straightforward. A new MPEG-2 part (part 8) was started but soon industry’s interest in this subject waned and the work came to a halt. So, don’t look for MPEG-2 part 8, even though there are more MPEG-2 parts starting from part 9. 

Another extension was the so-called Multiview profile. The idea behind this work was that it is possible to create elements of a 3D image from two views of the same image taken from two spatially separated video cameras pointing at the same scene. If the separation between the two cameras is small, the two pictures are only slightly different and the differences can be coded with very much the same type of technology as used in interframe coding, i.e. motion compensation. This was only the first instance of an activity that would continue with other video coding standards that sought to provide the “stereo” view experience when watching television. 


MPEG-2 Development – Audio

As with MPEG-1, the Audio work in MPEG-2 took a different turn from its original direction. MPEG-1 Audio already provided an excellent way to compress stereo audio, exactly what many broadcasters were thinking of providing as a first step in their soon-to-come digital services. But some expected that the future would lie with a further enhancement of the user experience, to be provided by multichannel audio services. For a Service Provider, it made a lot of sense to start with MPEG-1 stereo sound and upgrade it later to a multichannel audio service that could still be received by the existing population of MPEG-1 Audio receivers, even though the latter would continue to get only stereophonic, not multichannel sound. This was the same argument that was made by people who wanted to have a scalable MPEG-2 Video. Why was the audio argument accepted and not the apparently similar video argument? 

The answer to this question has many facets. On the one hand, there was a matter of personalities involved in the discussions in the two groups. On the other, there was the obvious consideration that the video part of a program would require, in general, one order of magnitude more bits than the audio part and, therefore, a slight inefficiency in the use of the total program bitrate for the audio could be tolerated, while for video inefficiency would come at too high a price to pay. 

At the Haifa meeting the decision was made to adopt the requirement that MPEG-2 Audio be backward compatible with MPEG-1 Audio. This requirement seemed to considerably restrict the range of technologies that could be submitted in response to the MPEG-2 Audio CfP. Still 10 submissions were received in response to the call. 

After a while, the Audio group began to feel uneasy because some felt that, by working exclusively on a backward-compatible solution – justified, as shown before, for digital television services – MPEG was excluding pure audio solutions where the excellence of the standard was going to be judged exclusively on the ground of the highest audio quality at the smallest amount of bits/s. This issue was raised by a US NB contribution submitted to the July 1993 meeting in New York, hosted by Columbia University. So the decision was made that, when carrying out MPEG-2 Audio Verification Tests on the Backward Compatible (BC) solution, MPEG would also use yet-to-be-identified Non-Backward Compatible (NBC) codecs in order to assess the improved performance that could be obtained with an unconstrained algorithm. If the tests showed that the backward-compatibility constraint did introduce too heavy a compression penalty, MPEG would initiate the development of a new, NBC multichannel audio coding standard. 

I personally liked (and continued to do so in the following years) the idea of creating an internal competition between what was bound to be two groups of people working on different technologies because competition could only improve the performance of both the BC and NBC multichannel audio coding solutions. 

At the same meeting I was involved in an unusual case. One evening, past midnight, I was working with a group of MPEG members in a room at Columbia University that was hosting the MPEG meeting. Tristan Savatier, then with Thomson Consumer Electronics, Los Angeles and a very active member of the Video group, felt the need for a cup of coffee and went out to get one but found the kitchen door locked. He worked on the lock, got in the kitchen and had his cup of coffee but was caught red-handed by the night security. Of course his intentions became known to me only after the fact. I had to take responsibility for Tristan’s future actions – for that evening, I mean, not forever – or I would have lost his work that night.

Unexpectedly, at the Paris meeting in March 1994, the US NB requested that MPEG endorse a specific proprietary multichannel audio coding solution as one element of the MPEG-2 Audio standard family. My reaction, at the mid-week plenary, that this was not in line with the MPEG policy of major standards developed within the group, was greeted with whistles of disapproval on the part of some members. The Friday plenary saw a rather lengthy monologue of mine interrupted by a few exchanges of words with some MPEG members. This was a christening of fire for Peter Schirling of IBM, who had just been appointed as head of the US delegation, as Cliff Reader had left that position one year before at the Sydney meeting and had been replaced by Greg Wallace, then with 3DO, who had left that position the meeting before. The meeting ended with a confirmation of the MPEG policy that continues until the end of MPEG. 

One MPEG member recorded this monologue on an audio cassette (current ISO rules would not allow this) and, subsequently, Tristan Savatier got a copy of the tape and converted it to MPEG-1 Audio Layer II and posted it on a web site. The posting was structured in a way that looked like a soloist performance with titles created from the more interesting (for him) passages of my monologue, much as in an Italian opera. My reaction to this initiative was that, since I had not released the copyright of my “performance”, the posting was illegal and should be removed (call it my version of “cease and desist”). This request of mine, however, was met by a shrug of the shoulders (virtual, as this happened by email). Therefore I can probably claim to have been the target of the first example of an unauthorised posting of a “performance” (not musical, I agree, but performance it still was) on the web. True that the coding technology used was still MPEG-1 Audio Layer II and not the eventually more famous Layer III, but that was just a proof of how people were working hard to bring MPEG technologies to a mass market. 

The work on what would eventually be called Advanced Audio Coding (AAC), followed the usual steps of requirement definition, CfP and collaborative development. Marina Bosi, then with Dolby Labs, was appointed as its editor. While returning to the hotel one evening of the AES Convention in New York, she was badly hit by a taxi and had to undergo several surgeries before recovering. The group deeply appreciated her determination in the way she carried out her duties while in such terrible personal circumstances that would have crushed the resistance of many.  When Marina was reporting the completion of the AAC work at the Bristol meeting in April 1997 before the final approval, I asked her what she was still using her walking stick for (and she still badly needed it at that time). She then defiantly set it aside and, standing, completed her report. 

The Verification Tests (VT) showed that subjective transparency was achieved at 128 kbit/s, a 50% gain over MPEG-1 Audio Layer II! As for MP3, the best AAC encoders today can provide even better performance. The VT also confirmed that the original target of “indistinguishable” audio quality at 384 kbit/s for five full-bandwidth channels was achieved and exceeded: tests carried out by BBC and NHK showed that 320 kbit/s were sufficient to achieve the target.


MPEG-2 Development – Systems

The development of the MPEG-2 Audio and Video standards required the best experts in digital audio and video processing, but the development of the Systems part required seasoned engineers, a species in scarce supply today. Because of that, we might no longer be able to access their unique expertise (and non because of their untimely departure but because they are spending their days on some exotic beaches). The lucky side for MPEG at that time was that there were plenty of them – and very good ones – because so many companies were waiting for a solution to make products or offer services. 

The MPEG-2 media coding parts of the standard had been designed to be “generic” (hence the title eventually given to MPEG-2 as “Generic coding of moving pictures and associated audio”). Therefore the requirements of the application domains on the Systems part, as the interface with the application domains, were a huge task that was again assigned to the Requirements group. 

A major requirement was that digital television would be carried by delivery systems that were mostly analogue, typically Hertzian channels and CATV. Different industries and countries had plans to develop solutions to digitise them with appropriate modulation schemes. So MPEG could assume that digitisation would “happen” (as in fact it did, albeit in a very disorderly and non-uniform fashion) but there were a number of functionalities between the media coding and the physical layer, such as multiplexing of different television programs, that were roughly equivalent to an OSI “transport layer” and that were not going to be provided by modulation schemes. 

A brand new “systems” layer was needed with completely different requirements than those that had led to the definition of MPEG-1 Systems. The MPEG-1 Systems layer had adopted a packet multiplexer, which I consider a great achievement (and, as I said, a personal technical vindication). This had happened thanks to the positive interaction between a group of IT-prone members and other open-minded groups of telco and Consumer Electronics members. That this outcome was not discounted can be seen from the case of DAB, a service that uses MPEG-1 Audio and a traditional frame-based solution instead of the MPEG-1 Systems layer. The reasons are because the MPEG-1 Systems layer does not provide support for adaptation to the physical layer (e.g., it assumed an error-less environment, hardly a valid assumption in a radio channel), but more importantly because a packet-based multiplex was anathema to Audio engineers at that time. 

In the digital television domain, we were talking, if not of the same engineers, of people with a similar cultural background so the packet-based vs frame-based argument popped up again. Eventually the decision was made to adopt a fixed-length packet-based multiplexer, a choice that somehow accommodated both views of the world (what is the conceptual difference between a frame and a fixed-length packet?). This, however, only solved one half of the problem, because a multiplex laden with features designed to support transmission in a hostile environment was not the best choice for storage applications at that time. A new solution was required, quite similar to the MPEG-1 Systems standard one. 

The first definition of the MPEG-2 Systems layer was achieved at the Sydney meeting in March/April 1993, where it was recognised that a single solution encompassing both application domains was not feasible, at least in the very tight timeline of the project. Therefore the systems layer was defined as having two forms, one called Transport Stream (TS) and the other Program Stream (PS). 

There is no time to regret now, but I am still consumed by my failure to bring together all the industries that had an interest in a “transport solution for real-time media”. Granted that reconciling so many conflicting requirements could have been challenging but now the PS and TS basically have no common root if not the rather evanescent Packetised Elementary Streams (PES). As a result, the industries in need of a TS or PS solution went away with their part of the booty, while the telcos looked disdainfully from a distance at the TS/PS debate without even trying to join the discussion, being lost as they were in their ATM Adaptation Layer (AAL) dispute of AAL1/AAL2 vs. AAL5. My regret is augmented by the fact that MPEG did have enlightened and competent people who could have provided the unifying solution withstanding the unnatural solution, designed for non-real time data on the network, that was forced down the throats of us media people for real-time media. 

The request that the US National Body had made in Paris about a non-MPEG audio codec had been rejected, but the reasons that had prompted it remained unchanged. Indeed the USA, with their ATV project, were moving ahead with plans to deploy their terrestrial digital television system (which they did in 1997) and they wanted to use MPEG-2 Systems and Video but use a non-MPEG audio codec. How was it possible for them to do so if the system did not recognise a non-MPEG audio bitstream? 

The problem was solved by establishing a Registration Authority (RA), a standard ISO mechanism to cater to an evolving standard that needed the addition of references but without following the rather cumbersome process of Amendments or new Editions. Those who wanted to have their proprietary streams carried by the MPEG-2 Systems layer would register that stream with the RA which would then assign a registration number to be carried in an appropriate field of the bitstream. The Society for Motion Pictures and Television Engineers (SMPTE) was eventually appointed by ISO as the RA for this so-called “format identifier”. 

With the same mechanism, it was possible to accept a request made at the Singapore meeting in November 1994 by the Confédération Internationale des Sociétés des Auteurs et des Compositeurs (CISAC), the international confederation of rights societies of authors and composers. The request was to provide the means to signal copyright information regarding the video stream, the audio stream, and the audio-visual stream in an MPEG-2 stream. The so-called “copyright identifier” solved the problem with a two-field number where the first field identifies the agency managing the rights to the stream and the second field the identifier assigned by that agency to the specific content item. Again, the solution requires a RA where agencies can go to and get their identifiers. 

Another, very important, component was added to MPEG-2 Systems. This was in response to the request from pay TV operators to provide an infrastructure on top of which proprietary protection schemes could be implemented. The addition of two special messages solved the problem: Entitlement Control Messages (ECM) and Entitlement Management Messages (EMM). More about this later.

All that has been described so far was sufficient for the particular, though very important, Over-The-Air (OTA), satellite, and cable broadcasting constituencies, not for those – the telecommunication and CATV industry – which employed physical delivery media. To stay in or to move into the business of digital television competitively, these industries needed a standard protocol to set up a channel with the remote device and to let a receiver interact with content stored at the source. The Digital Storage Media (DSM) group provided the home for this important piece of work. 

An incredibly active group of people started gathering under the chairmanship of Tom Lookabough first and of Chris Adams later, both of Divicom, to develop the Digital Storage Media Command and Control (DSM-CC) standard that became part 6 of MPEG-2. In the best MPEG tradition, MPEG developed a completely generic standard. So, even if the DSM, telco and CATV industries had triggered the work, the final protocol is generic in the sense that it can be used both in the case a return channel exists and when the channel is unidirectional. In the latter case the transmitter can use a carousel, but the receiver is presented with a single interface. Ironically, because the Video on Demand (VOD) business did not fare as expected, the carousel part of the DSM-CC standard is widely used in broadcast applications. 

The last major component of MPEG-2 is the so-called Real-Time Interface (RTI). This was developed because the MPEG-2 Systems specification assumes that packets arrive at the decoder with zero jitter, clearly an idealised assumption that holds reasonably well in most OTA broadcast, satellite and CATV environments, but is not a valid assumption for such packet-based networks as ATM and Internet Protocol (IP). The purpose of part 9 of MPEG-2 is then to provide a specification for the level of jitter that an implementation is required to withstand. 

The MPEG-2 Systems, Video and Audio Committee Draft (CD), the first stage of the standard issued for ballot, was approved in Seoul in November 1993, after one of the most intensive weeks in MPEG history. Some delegates worked until 6 am on Friday to produce the three Systems, Video, and Audio drafts so that they could be photocopied and distributed to all members for approval at the afternoon plenary. It was at that meeting that the mark of one million photocopied pages was reached. 

The short night did not prevent Tristan Savatier from staging another of his tricks. He convinced one of the lady delegates to lend him her stockings and shoes and, during the coffee break of the Friday afternoon plenary, he hid under my desk wearing the stocking in his hands and arms, and the shoes in his hands. When I resumed the meeting he started showing the stockings and the shoes from below the desk as if they were mine. 

The following Paris meeting allowed people to make a review of the work done at the intense Seoul meeting. The systems part was found to need a major overhaul and so it was decided that a special meeting would be held in June in Atlanta, GA, hosted by Scientific Atlanta, just before the regular July meeting. With this additional effort MPEG final approved the standard in Singapore in November 1994, as planned. 

I would like to conclude this chapter by reporting what VADIS did to promote the development of MPEG-2 and specifically CSELT’s role in it. Besides active participation in tens of CEs, VADIS carried out a thorough campaign of field trials to assess the performance of the MPEG-2 standard. Some VADIS members produced audio-visual bitstreams, others made available transmission adaptors, like one of the first modems for satellite, cable and terrestrial UHF, still others made available their ATM networks. CSELT had continued working on its multiprocessor architecture (the third generation, using an Intel 860 RISC instead of the original 80186 and five 2901 DSPs per board) and produced two real-time MPEG-2 decoders. The two decoders used to be still in regular use in my lab when I left in 2003. Another achievement of the project was the support given to the development of the VLSI design of an MPEG-2 Video decoder, which enabled Philips to become the 4th worldwide supplier of such chips, before it decided to sell its semiconductor division to NXP. 


Inside MPEG-2

Unlike MPEG-1, where a reference model strictly need only consider the decoder, the full extent of the MPEG-2 standard requires consideration of the complete chain from source to destination for the DSM-CC part. The figure below gives a schematic representation of the scope of the MPEG-2 standard. The Source, however, is still not part of the standard.

MPEG-2_decoder_model

Figure 1 – Model of the MPEG-2 standard

The Demultiplexer, Video and Audio decoding blocks correspond to part 1, 2 and 3 (or 7) of ISO/IEC 13818. The DSM-CC block, not in the MPEG-1 model, is specified in part 6. Additionally part 9 provides the jitter specification of the interface between the receiver and the delivery system. Part 4 is conformance testing for parts 1, 2, 3 and 7. Similarly part 5 is the Reference Software for parts 1, 2, 3 and 7 (no reference software exists for part 6). Part 10 is conformance testing of DSM-CC. A part 11 has been added later providing specification for an extension of conditional access functionality.

MPEG-2 Systems is a great piece of engineering work designed to satisfy a broader range of multi-industry requirements than any other groups had ever tried to deal with before in a standards committee. The figure below illustrates the basic multiplexing approach for a television program composed of a single video stream and a single audio stream. 

PS-TS

Figure 2 – Model of MPEG-2 Systems

The video and audio data are encoded according to MPEG-2 Video and MPEG-2 Audio or AAC, and systems level information is added to the resulting compressed streams. These streams in packetised form are called Packetised Elementary Streams (PES). The PESs can then be further combined to form either a PS or a TS. 

The PS results from combining one or more PESs, all having a common time base, into a single stream, much like the MPEG-1 Systems Multiplex. The PS is designed for use in the same relatively error-free environments and is suitable for software processing applications. PS packets may be of variable and relatively great length.

The TS combines one or more PESs with the same or different time bases into a single stream, a requirement for the carriage of more than one TV programs that in general have been produced with independent clocks. The TS is designed for use in environments where errors are likely, such as storage or transmission in lossy or noisy media. TS packets are 188 bytes long. This choice was made, for lack of better reasons, because at the time it was thought that having a packet length related to ATM Adaptation Layer 1 (AAL1) packet lengths (188=4×47) expected to be used for real-time video transmission, would be an advantage. 

MPEG-2_TS

Figure 3 – Structure of an MPEG-2 TS packet

Part 2 extends MPEG-1 by adding several new tools. The most important ones are those that encode interlaced video. When video is already in progressive form, MPEG-2 Video coding falls back to MPEG-1 Video. Exploitation of interlace information provides an improvement of compression efficiency of about 20%. MPEG-2 Video also provides tools for various types of scalability, namely SNR and spatial scalability. SNR scalability is a functionality that is provided by multiple layers such that an enhancement layer carries DCT coefficients quantised with improved accuracy. In spatial scalability there is a basic layer and an enhancement layer. The latter carries information that can be used to improve the spatial definition of the picture of the basic layer. 

Part 3 extends MPEG-1 Audio from the stereo to the multichannel case while preserving backwards compatibility with MPEG-1 Audio. This means that an MPEG-1 Audio decoder is able to extract the stereo part from an MPEG-2 Audio bitstream. The opposite is also true. An MPEG-2 Audio decoder is capable of decoding an MPEG-1 Audio bitstream. MPEG-2 Audio also provides an extension to lower sampling-frequency for MPEG-1 Audio.

Parts 4 and 5 correspond to those of MPEG-1: Conformance and Software Simulation. 

Part 6 has the title “Digital Storage Media Command and Control (DSM-CC)” and supplies protocols to establish audio-visual sessions on heterogeneous networks (this is the User to Network part of the standard) and to control audio-visual streams in both interactive and broadcast environments (this is the User to User part of the standard). For interactive environments, the DSM-CC User-to-User protocol allows clients to access a collection of distributed objects (such as files, directories and streams) located on remote servers. For broadcast environments, the DSM-CC User-to-User Object Carousel protocol allows simultaneous access to broadcast objects by various clients (see Fig. 3)). 

dsm-cc

Figure 3 – The DSM-CC model

In the DSM-CC model, a stream is sourced by a Server and delivered to a Client, both considered to be Users of the DSM-CC network. Additionally DSM-CC defines a logical entity called the Session and Resource Manager (SRM) which provides a (logically) centralised management of the DSM-CC Sessions and Resources.

Part 7 has the title “Advanced Audio Coding (AAC)” and supplies an alternative way, non backward compatible with MPEG-1 Audio, to encode stereo and multichannel audio. 

AAC achieves coding gain primarily through three strategies:

  1. Removes redundancy based on purely statistical properties of a signal by means of a high-resolution transform (1024-frequency-bins) 
  2. Reduces irrelevancy (removes of information based on the fact that it is not perceivable) by determining a threshold for the perception of quantization noise based on a continuously signal-adaptive model of the human auditory system
  3. Uses entropy coding to match the actual entropy of the quantised values with the entropy of their representation.

AAC also provides tools for the joint coding of stereo signals and other coding tools for special classes of signals. Fig. 4 is an AAC encoder block diagram, in which the modules providing the primary coding gain are highlighted.

AAC Encoder

Figure 4 – AAC encoder

AAC has 3 profiles: Main Profile, Low Complexity (LC) Profile and Scaleable Sampling Rate (SSR) Profile. The Main Profile provides the best quality but is more complex than the LC Profile. The SSR Profile has lower complexity than the Main and LC Profiles. Additionally the SSR Profile can provide a frequency scalable signal.

Part 9 is titled “Real Time Interface for System Decoders (RTI)” and is the specification of the RTI to Transport Stream decoders which may be utilised for adaptation to all appropriate networks carrying Transport Streams (see figure below).

RTI_model

Fig. 5 – RTI model

Part 10 is the conformance testing for DSM-CC. 

Part 11, approved in 2003, extends the functionality of content protection. This will be described in more detail later. 

A diligent reader, after reaching this point might like to know more technical details about the MPEG-2 standard. Such a curiosity could find a response in the MPEG-2 resource page of mpeg.chiariglione.org that acted as _the_ MPEG group’s web site until the group’s dissolution in the fall of 2020 by clear forces urged by oscure forces.