Last update: 2011/08/21
An overview of the technical content of other MPEG-4 components.
MPEG-4 Visual provides a coding algorithm for natural video that is capable of operating from 5 kbit/s with a spatial resolution of QCIF (144x176 pixels) scaling up to bitrates of some Mbit/s for ITU-R 601 resolution pictures (288x720@50Hz and 240x720@59.94 Hz). The Studio Profile brings the operation range in excess of 1 Gbit/s. It is ITU-T H.263 compatible in the sense that a basic H.263 bitstream is correctly decoded by an MPEG-4 Video decoder.
As mentioned before, MPEG-4 Video supports conventional rectangular images and video (upper portion of Fig. 1 below) as well as images and video of arbitrary shape (lower portion of figure).
Fig. 1 - The MPEG-4 Video Core and the Generic MPEG-4 Coder
The coding of conventional images and video is similar to conventional MPEG-1/2 coding. It involves motion prediction/compensation followed by texture coding. For content-based functionalities, where the image sequence input may be of arbitrary shape and location, coding shape and transparency information is encoded as well. Shape may be either represented by an 8 bit transparency component - which allows the description of transparency if one Video Object (VO) is composed with other objects - or by a binary mask.
The basic coding structure is represented in the figure below. This involves shape coding (for arbitrarily shaped VOs) and motion compensation as well as DCT-based texture coding (using standard 8x8 DCT or shape adaptive DCT).
Fig. 2 - The MPEG-4 Video coding scheme
MPEG-4 Video can offer unexpectedly high compression ratios if it is possible to exploit the a-priori knowledge of the scene. In the figure below
Fig. 3 - Background and sprites in MPEG-4 Video
coding of the top left figure would require a considerable amount of information but, if it is possible to separate the background and the sprite (top right), coding of the picture below can be achieved with relatively few bit/s.
The ‘facial animation object’ can be used to render an animated face. The face object contains a generic face with a neutral expression. This can be rendered as such. The shape, texture and expressions of the face are controlled by Facial Definition Parametres (FDP) and/or Facial Animation Parametres (FAP).
Fig. 4 - Face Definition Parametres
Upon receiving the animation parameters from the bitstream, the face can be animated: expressions, speech, etc. and FDPs can be sent to change the appearance of the face from something generic to a particular face with its own shape and texture. If so desired, a complete face model can be downloaded via the FDP set. Face models themselves are not mandated by the standard. It is also possible to use specific configurations of the lips and the mood of the speaker.
The Body is an object capable of producing virtual body models and animations in the form of a set of 3D polygonal meshes ready for rendering. Two sets of parameters are defined for the body: Body Definition Parametee (BDP) set, and Body Animation Parametre (BAP) set. The BDP set defines the set of parametres to transform the default body to a customised body with its body surface, body dimensions, and (optionally) texture. The BAPs will produce reasonably similar high level results in terms of body posture and animation on different body models.
MPEG-4 Audio provides complete coverage of the bitrate range of 2 to 64 kbit/s. Good coded speech is obtained already at 2 kbit/s and transparent quality of monophonic music sampled at 48 kHz and 16 bits/sample is obtained at 64 kbit/s. Three classes of algorithms are used in the standard. The first covers the low bitrate range and has been designed to encode speech. The second can be used in the midrange to encode both speech and music. The third can be used in the high bitrate range and can be used for any audio signal.
MPEG-4 Audio contains a large set of coding tools through which it is possible to construct several audio and speech coding algorithms
MPEG-4 AAC is MPEG-2 AAC with the addition of one tool: Perceptual Noise Substitution (PNS). This tool identifies and codes as random noise segments of spectral coefficients that appear to be noise-like. This is achieved by indicating that PNS is used and the value of the average power of the noise. A decoder uses a pseudo-random noise generator weighted by the signaled power value to reconstruct those coefficients.
In the area of synthetic audio two important technologies are available. The first is a Text To Speech (TTS) Interface (TTSI), i.e. a standard way to represent prosodic parameters, such as pitch contour, phoneme duration, and so on. Typically these can be used in a proprietary TTS system to improve the synthesised speech quality and to create, with the synthetic face, a complete audio-visual talking face. The TTS can also be synchronised with the facial expressions of an animated talking head as in the figure below.
TTS-driven Face Animation
The second technology provides a rich toolset for creating synthetic sounds and music, called Structured Audio (SA). Using newly developed formats to specify synthesis algorithms and their control, any current or future sound-synthesis technique can be used to create and process sound in MPEG-4. The sound quality is guaranteed to be exactly the same on every MPEG-4 decoder.