Riding the Media Bits

Last update: 2011/08/21

Riding the media bits

 

 

Inside MPEG-1

 

An overview of the technical content of MPEG-1.


MPEG-1 is formally known as ISO/IEC 11172. ISO/IEC refers to the fact that the standard is "owned" by both ISO and IEC because JTC 1, the Technical Committee under which MPEG operates, is a joint ISO and IEC Technical Committee. The number "11172" is a 5-digit serial number that the ISO Secretariat assigns to identify a new project of standard and that follows it throughout its life cycle. 

The title of MPEG-1 is rather convoluted: "Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s". Interestingly, the reference bitrate of 1.5 Mbit/s (the comma used in the 1,5 of the title is because of ISO conventions), that was the original driver of the work, appears only in the title, but is nowhere referenced in the text of the standard. More about this later in this page. 

A major departure from other similar standards, such as those produced by the ITU-T in the audio coding domain, is that MPEG-1 does not define an End-to-End "transmission system", it only defines the receiving end of it. Actually, it does not even do that, because the standard only provides the "information representation" component of the audio-video combination, in other words the format of the information entering the receiver. This is a common feature with the most MPEG standards.

Philosophically this is quite pleasing and is the implementation of a principle that should drive all communication standards that should only define what is needed to understand the message, not the way the message is constructed. In other words, a communication standard (I would submit that this should be true for all standards) should specify the bare minimum that is needed for interoperability. Actually, a standard that does not address interoperability is meaningless. When I hear members of some standards committees saying: "from now on we will work on interoperability", I always wonder what they have been doing until then. On the other hand it is also true that a standard that overspecifies the domain of its applicability may make more harm than good.

Having MPEG-1 (and all other following standards) written in a decoder-centric way has been a personal fulfillment. Indeed, back in 1979 I submitted to a COST 211 meeting a version of the H.120 Recommendation re-written with a decoder-centric viewpoint: describe what you do with the serial bits entering the decoder and how they are converted into pixels, as opposed to the encoder-centric viewpoint of the Recommendation: describe what you do with video and audio pixels entering the encoder and how they are converted into serial bits on the output wire. Further, the MPEG-1 standard does not say anything about how the message is actually carried by a delivery system. It only specifies some general characteristics of it, the most important of which is to be error-free. 

This contrasts with the approach that was followed by other, industry-specific, communication standards like MAC developed by EBU in the 1980s. Indeed, the MAC standard is a definition of a complete transmission system where everything is defined, from the 11 GHz channel of the satellite link down to the data multiplex, the video signal characteristics, and the the digital audio and character coding. I say "down" because, this is broadcasting, and the physical layer is "up". 

To be fair, I am comparing two systems designed with very different philosophies in mind. MAC is a system designed by an industry - European broadcasters  - writing the specification for a system to be used by their members, while MPEG-1 is a specification written for use by a multiplicity of industries, some of which MPEG did not even anticipate at the time it developed the standard. Even so, I think it is always helpful, whenever it is possible, to write technical specifications in a way that it is possible to isolate different subsystems and define interfaces between them. It makes the design cleaner, it facilitates reusability of components and creates an ecosystem of competing suppliers from system level down to component level. The last point is the reason why the open interface approach to standards is not favoured by some industries, say telco manufacturers - and the reason why IT products are making inroads in their business. MPEG-1 is another departure from this traditional view of standards as monoliths. 

Even though the three parts of ISO/IEC 11172 are bound together ("one and trine", as MPEG-1 was called), users do not need to use all three of them. Indeed, in ISO, a "part" of a standard is itself a standard, so it is possible to only use the Systems part and attach proprietary audio and video codecs. Not that this is encouraged, but this "componentisation" approach extends acceptance of the standard and lets more manufacturers compete. On the other hand it is clear that some customers may not need or want to have the pieces of the standard (providing an interface cost more than no interface at all) and in that case they can order a single system from one supplier.

The figure below is a simple description of the components of the MPEG-1 standard. 

 

Fig. 1 - Reference model of MPEG-1

 

Serial bits arrive at the MPEG-1 decoder from a delivery medium-specific decoder (e.g. the pick up head of a CD player) and are interpreted by the Systems decoder. This passes on "video bits" to the Video decoder and "audio bits" to the Audio decoder. 

The Systems decoder processes the other non-audio and non-video bits contained in the bitstream, i.e. those carrying timing and synchronisation information, and the result of the processing is handed over to the Video and Audio decoders. It is to be noted that MPEG-1 is capable of handling an arbitrary number of compressed video and audio streams with the constraint that these must all have the same time base. 

There are two main pieces of timing information that are passed on to the system decoder. The first is the so-called Decoding Time Stamp (DTS) that tells a decoder when to decode the video or audio information that has been so stamped. The second is the so-called Presentation Time Stamp (PTS) that tells the video and audio decoders when to present (i.e. display, in the case of video) the video or audio information that has been so stamped. In this way an MPEG-1 stream is a self-contained piece of multimedia information that can be played back without the need of a lower-layer entity, such as a transport. 

MPEG-1 Systems specifies how bitstreams of compressed audio and video data are combined. A packet-based multiplexer serialises the audio and video streams and keeps them synchronised. MPEG-1 Systems assumes that the reference time base is provided by a Systems Time Clock (STC) operating at 90 kHz (=1/300 of the 27 MHz sampling frequency used for digital video). STC values are represented with 33-bit accuracy and incremented at 90 kHz rate. The bitstream carries its own timing information in the Systems Clock Reference (SCR) fields. PTSs, represented with 33-bit accuracy, give the time the author expects the audio or video information to be presented. Note that MPEG-1 does not say anything of the use that will be made of audio or video samples, because MPEG-1 does not address "presentation" of decoded samples. The processing of an MPEG-1 bitstream requires a buffer. Therefore MPEG-1 Systems utilises a Systems Target Decoder (STD) model and DTSs. The latter are required because MPEG-1 Video makes use of B-pictures that require reordering of information at the decoder side.

The input video can be modeled as three 3D arrays of pixels, where the first two dimensions relate to the spatial visual information and the third one corresponds to time. The MPEG-1 Video coding can be defined as a function producing a bitstream taking three 3D arrays of pixels as input. Unlike other standards like H.261, however, MPEG-1 does not have any pre-specified value for these 3D arrays of pixels. In particular, it says nothing about the size of the picture, that can be any value up to the maximum size of 4,096x4,096 pixels, and also says nothing about the time spacing between two consecutive 2D arrays of pixels that can assume any value from slightly more than 1/24 s to 1/60 s. The only major - and deliberate - constraint is that the spatial position of pixels in consecutive pictures be the same. In other words MPEG-1 Video can only handle "progressive", i.e. not interlaced, pictures.

Keeping the described flexibility in terms of number of pixels per line, lines per picture and pictures per second is the right thing to do when writing a standard that is conceived as an abstract Signal Processing (SP) function operating on the three 3D arrays of pixels and producing a bitstream (and the opposite function at the decoder). Obviously it is not the right thing to do when a company makes products. If the decoder must be capable of decoding any picture size, at the bitrate chosen by the encoder, the decoder must be overdesigned to such an extent that its cost would easily put the product out of the market. 

This is the reason why MPEG-1 Video specifies a set of parameters, called Constrained Parameter Set (CPS), given by the table below, that correspond to a "reasonable" set of choices for the market needs of the time the standard was produced. 

MPEG-1 Video Constrained Parameter Set 

Parameter   Value Units 
Horizontal size  768   Pixels
Vertical size  576  Lines 
No. of macroblocks/picture  396 
No. of macroblocks/second  9900 
Picture rate  30  Hz 
Interpolated pictures  
Bitrate  1,856  kbit/s 

The maximum number of horizontal pixels - 768 - is a commonly used value for computer displays and the maximum number of scanning lines is the number of active lines in a PAL frame. The maximum number of macroblocks/picture corresponds to a quarter of pixels in a "PAL" SIF (288/16x352/16), the corresponding value for "NTSC" SIF being 330 (240/16x352/16), and the maximum number of macroblocks/second is the same for both PAL and NTSC (396x25 and 330x30, respectively). The maximum bitrate is the net bitrate in a 2,048 kbit/s primary multiplex when one time slot has been used for, say, audio (29x64) in addition to TS0 and TS16, which are not available for payload. This is the only place where a number, somehow related to 1.5 Mbit/s of the title, is used.  

This table highlights the fact that an implementation of an MPEG-1 Video decoder is constrained by the size of the RAM (a multiple of 288x352), the number of memory accesses per second (a multiple of 288x352x25), the bitrate at the input of the decoder and the number of pictures that can be interpolated. Apart from that, it is absolutely irrelevant whether the pictures are from an NTSC or PAL source. 

As mentioned above, MPEG-1 Video is basically an outgrowth of H.261 with some significasnt technical differences, in addition to the flexibility of video formats that were limited to CIF or ¼ CIF in H.261. The first is the introduction of an "intraframe mode" (called I-pictures) that can be used to insert programmed entry points where information does not depend on the past, as a series of predictive pictures (called P-pictures) would make it. This is one major requirement coming from the flagship application of "storage and retrieval on Digital Storage Media". The second is the the addition of frame interpolation (called B-pictures) to frame prediction. This feature had been considered in the development of H.261 but discarded because of the coding delay it created for real-time communication. Indeed, for "storage and retrieval on DSM", the short additional delay caused by B-pictures is largely compensated for by the considerable picture quality improvement. It is also clear that the more pictures are interpolated, the more memory is needed and for this reason the CPS allows only up to 2 B-pictures. 

The figure below represents the hierarchical nature of MPEG-1 Video data whose elements are Group of Pictures (GOP), enclosed between two I-pictures, Pictures, Slices, Macroblocks (made of 4 Blocks) and Blocks (made of 8x8 pixels). 

 

Fig. 2 - Hierarchy of MPEG-1 Video

 

Slices are another departure from the Group of Block (GOB) structure of H.261. Three of the other technical changes are the increase in the motion estimation accuracy to ½ pixel, the removal of the loop filter, and different types of quantisation. 

The coding algorithm processes the 3D array of pixels corresponding to luminance. Pixels in one (x,y) plane at time (t) are organised in 8x8 blocks. If the picture is of type "I", the DCT linear transformation is applied on all blocks and the resulting DCT coefficients are VLC-coded. If the picture is of type "P", an algorithm tries to make the best match between a given macroblock at time (t+1) with one macroblock in the picture at time (t). For each macroblock, motion vectors are differentially encoded compared to the immediately preceding macroblock. Each block at time (t+1) is subtracted from the corresponding block at time (t) displaced by the amount indicated by the motion vector. The DCT linear transformation is then applied to the difference block. Motion Vectors and DCT coefficients are VLC-coded with various tricks to reduce the number of bits required to code these data. If the picture is of type B, each block is interpolated using the available anchors. Different picture types, variable length coding and, obviously, the different amount of motion in different parts of a sequence, make the overall data rate variable. 

If the channel has a fixed rate, a first-in-first-out (FIFO) buffer may be used to adapt the encoder output to the channel. The encoder will monitor the status of this buffer to control the number of bits generated by the encoder. Changing quantisation parameters is the most direct way of controlling the bitrate. MPEG-1 Video specifies an abstract model of the buffering system, called Video Buffering Verifier (VBV), in order to constrain the maximum variability in the number of bits that are used for a given picture. This ensures that a bitstream can be decoded with a buffer of known size. 

The algorithm adopted for MPEG-1 Audio Layer I and II is a typical subband-coding algorithm, as represented in the figure below.

 

Fig. 3 - Model of MPEG-1 Audio encoder

 

PCM audio samples are fed into a bank of polyphase filters with 32 subbands. The filter bank decomposes the input signal into subsampled spectral components. In case of a Layer III encoder, a Modified DCT transform is added to increase the frequency resolution, which is 18 times higher than Layer II. Therefore the filtered or "mapped" samples are called subband samples in Layer I and II, and DCT-transformed subband samples in Layer III. A psychoacoustic model is used to calculate an estimate of the masking threshold, i.e. the noise level that is just below the perception threshold, and this is used to control the quantisation and coding block. An intelligent encoder will allocate the available number of bits per block so that the quantisation noise is kept below the masking threshold. The allocation strategy and the psycho-acoustic model are not specified by the standard and the latter is actually a very strong differentiator between encoders from different manufacturers. The standard only provides a very basic informative description of one psycho-acoustic model. 

The "bitstream formatting" block assembles the actual bitstream from the output data of the other sources, and adds other information (e.g. error correction) if necessary. The resulting data are then packed in a fixed-length packet of data using a bitstream structure that contains separately critical parts whose reliable transmission must be high. There are four different modes possible. The first two are single channel and dual channel in which two independent audio signals are coded within one bitstream. The second two are stereo in which the left and right signals of a stereo pair are coded within one bitstream and Joint Stereo in which the left and right signals of a stereo pair are coded within one bitstream with the stereo irrelevancy and redundancy exploited. Layer III has a number of features that enable better performance compared to the lower two layers: it uses entropy coding to further reduce redundancy and a buffer to smooth out high variations in output bits, and more advanced joint-stereo coding methods. 

 

Fig. 4 - Model of MPEG-1 Audio decoder

 

At the decoder, bitstream data are read from a delivery medium specific decoder. The bitstream data are unpacked to recover the different pieces of information, and the bitstream unpacking block also does error detection if error-checking is applied in the encoder. The reconstruction block reconstructs the quantised version of the set of mapped samples. The inverse mapping transforms these mapped samples back into PCM samples.