Riding the Media Bits

Last update: 2011/08/21

Riding the media bits

 

 

New round of video compression

 

More video compression has come after MPEG-1 Video, MPEG-2 Video and MPEG-4 Visual.


During the development of MPEG-4, several liaison statements were sent to ITU-T suggesting to work together on the new MPEG-4 Visual standard and even on the MPEG-4 Audio standard, specifically on the speech coding part. As no responses were received to these offers MPEG continued the development of MPEG-4 alone.

For several years the Video Coding Experts Group (VCEG) of Study Group 16 of ITU-T worked from the ground up on the development of new video compression technologies and achieved a breakthrough in compression performance around the turn of the century. In spite of the lack of official answers to our liaison statements from ITU-T, I decided that it would be in the industry interest to establish a working relationship. When Thomas Sikora left MPEG I appointed Gary Sullivan, VCEG rapporteur, as MPEG Video chair to manage to achieve a belated convergence of the ITU-T and MPEG efforts in the area of video coding.

At the July 2001 meeting MPEG reviewed the results of video compression viewing tests designed to assess whether there was evidence for advances in video coding technology that warranted the start of a new video coding project. With the positive result of the review a Call for Proposals was issued and in December the Joint Video Team (JVT) composed of MPEG and VCEG members was established.

With an intense schedule of meetings, the JVT managed to achieve the Final Draft International Standard stage of the new Advanced Video Coding (AVC) standard in March 2003. So AVC became part 10 of MPEG-4.

The AVC standard specifies

  1. A video coding layer (VCL) for efficient representation of the video content
  2. A network abstraction layer (NAL) to format the VCL representation and provide header information (Fig. 1).

The most important control data are

  1. The sequence parameter set (SPS) that applies to an entire series of coded pictures
  2. The picture parameter set (PPS) that applies to one or more individual pictures within such a series.

Fig. 1 - The AVC layers

AVC is based on the so-called "block based hybrid video coding" approach where a coded video sequence consists of an independently-coded sequence of coded pictures.

  1. The VCL data for each picture is encoded with a reference to its PPS header data, and the PPS data contains a reference to the SPS header data
  2. In addition to motion-compensated inter-picture prediction a more sophisticated intra-picture prediction exploiting dependencies between spatially-neighbouring blocks within the same picture is used
  3. Spatial block transform coding exploiting the remaining spatial statistical dependencies within the prediction residual signal for a block region (both for inter- and intra-picture prediction) is used
  4. Motion compensation of the luma component block uses by quarter-sample accuracy and high-quality interpolation filters
  5. Transformation (integer approximation of DCT, hence no drift between encoder and decoder picture representations) is defined for block size 4x4 or 8x8
  6. Macroblocks of size 16X16 and their further subdivisions down to 4X4 are the basic building blocks of the encoding process e.g., for motion compensation (MC) and linear transformation
  7. Adaptive de-blocking filter is used in the prediction loop
  8. It is possible to use references for prediction of any macroblock from one of up to F previously decoded pictures
  9. A picture may be split into one or several slices, i.e. sequences of macroblocks which are typically processed in the order of a raster scan
  10. Definition of B-type, P-type, and I-type pictures is made slice-wise
  11. Two different entropy coding mechanisms are used
    1. Context-Adaptive VLC (CAVLC)
    2. Context-Adaptive Binary Arithmetic Coding (CABAC)
  12. The hypothetical reference decoder (HRD) specifies how/when bits are fed to a decoder and how decoded pictures are removed from a decoder
  13. In additions to compressed video data there is a need to make available to a decoder Supplemental Enhancement Information (SEI)

Fig. 2 shows the innovation broght about by point 8. in the list above

Fig. 2 - Multiple reference frames in AVC

Fig. 3 shows the innovation broght about by point 6. in the list above

Fig. 3 - Variable macroblock size in AVC

Following requests from the industry several profiles have been defined, some of which are

  • Baseline, Main, and Extended Profiles primarily focused on "entertainment-quality" video, based on 8-bits/sample, and 4:2:0 chroma sampling
  • Full range extensions (FRExt) for applications such as content-contribution, content-distribution, studio editing and post-processing
  • Professional quality (PQ) extensions for quality ranges beyond 10 bit/sample and 4:4:4 color sampling

The AVC endeavour kept its promise of reducing by half the performance of MPEG-2 Video.

At the same meeting the JVT was established, MPEG started an investigation in video scalability that eventually led to the development of requirements for Scalable Video Coding (SVC). In a nutshell these imply that the encoded video stream should be structured into a base layer stream, decodable by a non-scalable decoder and one or more enhancement layer stream(s) decodable by a decoder conforming to the SVC standard. A Call for Proposals was issued and this work item, too, was entrusted to the JVT even though the SVC standard can actually work on top of any base layer video coding standard. 

SVC is based on a layered representation with multiple dependencies. To achieve temporal scalability there is a need for frame hierarchies so that frames that are not used as references for prediction of layers that are still present can be skipped, as indicated in Fig. 2 where pictures marked as “B3” can be removed to reduce the frame rate by a factor of 3, and by removing those marked “B2” the frame rate is reduced by a factor of 2 etc.

Fig. 4 - SVC frame hierarchy

SVC offers a high degree of flexibility in terms of scalability dimensions, e.g. it supports various temporal/spatial resolutions, Signal-to-Noise (SNR)/fidelity levels and global/local Region of Interest (ROI) access). SVC performs significantly better and is much more flexible in terms of number of layers and combination of scalable modes than the scalable version of MPEG-2 Video and MPEG-4 Visual, while the penalty in compression performance, as compared to single-layer coding, is almost negligible,

For the purpose of spatial scalability, the video is first downsampled to the required spatial resolution(s). The ratio between frame heights/widths of the respective resolutions does not need to be dyadic (factor of two). Encoding as well as decoding starts at the lowest resolution, where an AVC compatible “base layer” bitstream will typically be used. For the respective next-higher “enhancement layer”, three decoded component types are used for inter-layer prediction from the lower layer:

  • Up-sampled intra-coded macroblocks;
  • Motion and mode information (aligned/stretched according to image size ratios);
  • Up-sampled residual signal in case of inter-coded macroblocks.

Fig. 4 - SVC block diagram

Seminal work on Multiview Video Coding was already carried out for MPEG-2 and more work was done for MPEG-4 Visual. In AVC Multiview Video Coding was added to AVC. ) standard that provides efficient coding of such multiview video. The overall structure of MVC defining the interfaces is illustrated in Fig. 5.

Fig. 5 - MVC model

The encoder receives N temporally synchronized video streams and generates one bitstream. The decoder receives the bitstream, decodes and outputs N Video signals that can be used for different purposes: to generate 1 view, N views of a stereo view.

Prediction across views , as shown in Fig. 6 is used to exploit inter-camera redundancy with the limitation that inter-view prediction is only effected from the same time instance and cannot exceed the maximum number of stored reference pictures


Fig. 5 - MVC prediction

The base view is independent of any other view and is AVC compatible that can be extracted to provide a compatible 2D representation of the 3D version.