Inside MPEG-7 – Riding the Media Bits

Before starting this page I must warn the reader that MPEG-7 has more elements of abstractness, compared to previous MPEG standards that may make reading it more difficult than other pages. With this warning, let’s start from some definitions that will hopefully facilitare understanding of the MPEG-7 standard.

Element	Definition	Examples
Data	The audio-visual information to be described using MPEG-7	MPEG-4 elementary streams, Audio CDs containing music, hard disks containing MP3 files, synthetically generated pictures or drawings on a piece of paper
Feature	A distinctive characteristic of a Data item that means something to somebody	Colour of a picture, particular rhythm of a piece of music, camera movement in a video or cast of a movie
Descriptor	A representation of a Feature. It defines the syntax and semantic of the representation of the Feature. Different Descriptors may very well represent the same Feature	Feature “colour” that can be represented as a histogram or as a frequency spectrum
Descriptor Value	An instantiation of a Descriptor for a given Data set. Descriptor Values are combined through the Description Scheme mechanism to form a Description
Description Scheme	The specification of the structure and semantics of relationships among its components. These can be Descriptors or, recursively, Description Schemes. The distinction between a DS and a D is that a D just contains basic data types and does not make reference to any other D (and, obviously, DS)	A movie that is temporally structured in scenes, with textual descriptions at scene level and some audio descriptors of dialogues and background music

In the following, Ds and DSs are collectively called Description Tools (DT).

The figure below represents the main elements making up the MPEG-7 standard.

Figure 1 – The main MPEG-7 elements

MPEG-7 provides a wide range of low-level descriptors.

MPEG-7 Visual Tools consist of basic structures and Ds that cover the following basic visual features: Color, Texture, Shape, Motion and Localisation.

The Color feature has multiple Ds. Some of them are:

Name	Descriptor
Color Quantization	Expresses colour histograms keeping the flexibility of linear and non-linear quantisation and look-up tablesThe audio-visual information to be described using MPEG-7
Dominant Color(s)	Represent features where a small number of colours suffice to characterize the color information in the region of interes
Scalable Color	Is useful for image-to-image matching and retrieval based on colour feature
Color Structure	Captures both colour content (similar to a color histogram) and information about the structure of this content whose intended use is for still-image retrieval because its main functionality is image-to-image matching
Color Layout	Specifies the spatial distribution of colors that can be used for image-to-image matching and video-clip-to-video-clip matching or for layout-based retrieval for color, such as sketch-to-image matching

The Texture feature has 3 Ds.

Name	Descriptor
Homogeneous texture	Is used for searching and browsing through large collections of similar looking patterns. An image can be considered as a mosaic of homogeneous textures so that these texture features associated with the regions can be used to index the image data. Agricultural areas and vegetation patches are examples of homogeneous textures commonly found in aerial and satellite imagery
Texture Browsing	Provides a perceptual characterization of texture, similar to a human characterization, in terms of regularity, coarseness and directionality. The computation of this descriptor proceeds similarly to the Homogeneous Texture D. First, the image is filtered with a bank of special filters. From the filtered outputs, two dominant texture orientations are identified. Then the regularity and coarseness is determined by analysing the filtered image projections along the dominant orientations
Edge histogram	Represents the spatial distribution of five types of edges: four directional edges and one non-directional edge. Edge histogram can retrieve images with similar semantic meaning, since edges play an important role for image perception

The Shape feature has 4 Ds. Region-Based Shape is a Descriptor to describe any shapes. This is a complex task because the shape of an object may consist of either a single region or a set of regions as well as some holes in the object or several disjoint regions.

The Motion feature has 4 Ds: camera motion, object motion trajectory, parametric object motion, and motion activity.

Name	Descriptor
Camera Motion	Characterises motion parameters of a camera in a 3-D space. This motion parameter information can be extracted automatically or generated by capture devices
Motion Trajectory	Describes the motion trajectory of an object, defined as the localisation, in time and space, of one representative point of this object. In surveillance, alarms can be triggered if some object has a trajectory identified as dangerous (e.g. passing through a forbidden area, being unusually quick, etc.). In sports, specific actions (e.g. tennis rallies taking place at the net) can be recognized

MPEG-7 Audio Tools. There are seventeen low-level Audio temporal and spectral Ds that may be used in a variety of applications. While low-level audio Ds in general can serve many conceivable applications, the Spectral Flatness D specifically supports the functionality of robust matching of audio signals. Applications include audio fingerprinting, identification of audio based on a database of known works and, thus, locating metadata for legacy audio content without metadata annotation.

Four sets of audio Description Technologies – roughly representing application areas – are integrated in the standard: sound recognition, musical instrument timbre, spoken content, and melodic contour. Timbre is defined as the perceptual features that make two sounds with the same pitch and loudness sound different.

Name	Descriptor
Musical Instrument Timbre	Describes perceptual features of instrument sounds. The aim of the Timbre D is to describe the perceptual features with a reduced set of Ds that relate to notions such as “attack”, “brightness” or “richness” of a sound
Sound Recognition	Indexes and categorises general sounds, with immediate application to sound effects
Spoken Content	Describes words spoken within an audio stream. This trades compactness for robustness of search, because current Automatic Speech Recognition (ASR) technologies have their limits, and one will always encounter out-of-vocabulary utterances. To accomplish this, the tools represent the output and what might normally be seen as intermediate ASR results. The tools can be used for two broad classes of retrieval scenario: indexing into and retrieval of an audio stream, and indexing of multimedia objects annotated with speech

One can easily see, from this cursory presentation, the range of tools that are being offered to content owners and to application developers. The MPEG-7 Ds have been designed for describing a wide range of information types: low-level audio-visual features such as color, texture, motion, audio energy, and so forth, as illustrated above; high-level features of semantic objects, events and abstract concepts; content management processes; information about the storage media; and so forth. It is expected that most Ds corresponding to low-level features will be extracted automatically, whereas more human intervention will be is likely to be required for producing higher-level Ds.

The MPEG-7 Multimedia Description Schemes part of the standard defines a set of DTs dealing with generic as well as multimedia entities. Generic entities are features which are used in audio, visual, and text descriptions, and are therefore “generic” to all media. These are, for instance, “vector”, “time”, etc. More complex DTs are also standardised. They are used whenever more than one medium needs to be described (e.g. audio and video.) These DTs can be grouped into 6 different classes according to their functionality as in the following figure.

Figure 2 – The MPEG-7 Multimedia Description Schemes

Elements	Definition
Basic Elements	Facilitate creation and packaging of descriptions
Content description	Representes perceivable information
Content management	Information about the media features, the creation and the usage of the AV content
Content organization	Represents the analysis and classification of several AV contents
Navigation and access	Specifies summaries and variations of the AV content
User Interaction	Describes user preferences and usage history

Basic Elements define a number of Schema Tools that facilitate the creation and packaging of MPEG-7 descriptions, a number of basic data types and mathematical structures, such as vectors and matrices, which are important for audio-visual content description. There are also constructs for linking media files and localising segments, regions, and so forth. Many of the basic elements address specific needs of audio-visual content description, such as the description of time, places, persons, individuals, groups, organisations, and other textual annotation.

Content Description describes the Structure (regions, video frames, and audio segments) and Semantics (objects, events, abstract notions). Structural aspects describe the audio-visual content from the viewpoint of its structure. Conceptual aspects describe the audio-visual content from the viewpoint of real-world semantics and conceptual notions.

The Content Management DTs allow the description of the life cycle of the content, from content to consumption including media coding, storage, and file formats and content usage.

Name	Description Tools
Creation Information	Describes the creation and classification of the audio-visual content and other material that is related to the audio-visual content: Title (which may itself be textual or another piece of audio-visual content), Textual Annotation, and Creation Information such as creators, creation locations and dates
Classification Information	Describes how the audio-visual material is classified into categories such as genre, subject, purpose, language, and so forth. It also provides review and guidance information such as age classification, subjective review, parental guidance, and so forth
Related Material Information	Describes whether other audio-visual material exists that is related to the content being described. Usage Information describes the usage information related to the audio-visual content such as usage rights (through links to the rights holders and other information related to rights management and protection), availability, usage record, and financial information. Media Description describes the storage media such as the compression, coding and storage format of the audio-visual data

Content Organisation organises and models collections of audio-visual content and of descriptions.

Navigation and Access facilitates browsing and retrieval of audio-visual content by defining summaries, partitions, decompositions, and variations of the audio-visual material.

Name	Description Tools
Summaries	Provide compact summaries of the audio-visual content to enable discovery, browsing, navigation, visualization and sonification of audio-visual content
Partitions and Decompositions	Describe different decompositions of the audio-visual signals in space, time and frequency. Variations provide information about different variations of audio-visual programs, such as summaries and abstracts; scaled, compressed and low-resolution versions; and versions with different languages and modalities – audio, video, image, text, and so forth

User Interaction describes user preferences and usage history pertaining to the consumption of the multimedia material. This allows, for example, matching between user preferences and MPEG-7 content descriptions in order to facilitate personalization of audio-visual content access, presentation and consumption.

The main tools used to implement MPEG-7 descriptions are DDL, DSs, and Ds. Ds bind a feature to a set of values. DSs are models of the multimedia objects and of the universes that they represent. They specify the types of the Ds that can be used in a given description, and the relationships between these Ds or between other DSs. The DDL provides the descriptive foundation by which users can create their own DSs and Ds and defines the syntactic rules to express and combine DSs and Ds.

The Description Definition Language satisfies the requirement of being able to express spatial, temporal, structural, and conceptual relationships between the elements of a DS, and between DSs. It provides a rich model for links and references between one or more descriptions and the data that it describes. The DDL Parser is also capable of validating Description Schemes (content and structure) and D data types, both primitive (integer, text, date, time) and composite (histograms, enumerated types). MPEG-7 adopted XML Schema Language as the DDL but added certain extensions in order to satisfy all requirements. The DDL can be broken down into the following logical normative components: the XML Schema structural language components; the XML Schema datatype language components and the MPEG-7 specific extensions.

The information representation specified in the MPEG-7 standard provides the means to represent coded multimedia content description information. The entity that makes use of such coded representation of the multimedia content is an “MPEG-7 terminal”. This may be a standalone application or a part of an application system. The architecture of such a terminal is depicted in the figure below.

Figure 3 – Model of an MPEG-7 terminal

The Delivery layer, placed at the bottom of the figure, provides MPEG-7 elementary streams to the Systems layer. MPEG-7 elementary streams consist of consecutive individually accessible portions of data named Access Units. An access unit (AU) is the smallest data entity to which timing information can be attributed. MPEG-7 elementary streams contain Schema information, that defines the structure of the MPEG-7 description and Descriptions information. The latter can be either the complete description of the multimedia content or fragments of the description.

MPEG-7 data can be represented either in textual, binary or a mixture of the two formats, depending on application requirements. A unique mapping between the binary format and the textual format is defined by the standard. A bi-directional loss-less mapping between the textual representation and the binary representation is possible, but this need not always be used. Some applications may not want to transmit all the information contained in the textual representation and may prefer to use a more bit-efficient binary lossy transmission. The syntax of the textual format is defined by the DDL and the syntax of the binary format, called Binary format for MPEG-7 data (BiM) was originally defined in Part 1 (Systems) of the standard but was later moved to Part 1 of MPEG-B.

At the compression layer, the flow of AUs (either textual or binary) is parsed, and the content description is reconstructed. The MPEG-7 binary stream can be either parsed by the BiM parser, transformed into textual format and then transmitted in textual format for further reconstruction processing, or the binary stream can be parsed by the BiM parser and then transmitted in proprietary format for further processing.

AUs are further structured as commands encapsulating the schema or the description information. Commands allow a description to be delivered in a single chunk or to be fragmented in small pieces. They allow basic operations such as updating a D, deleting part of the description or adding a new DDL structure. The reconstruction stage of the compression layer updates the description information and associated schema information by consuming these commands. Further structure of the schema or description is out of the scope of the MPEG-7 standard in its current form.