Machines Begin To Understand The World

Machines Begin To Understand The World

The increased dependency of many users, particularly on mobile devices, has initiated an unstoppable drive, supported by their increasing interaction and processing capability, to pack more functionalities in those devices. In some cases the functionality is entirely confined within the device, but in other cases the ability to interact with other devices or services depends on an agreed – i.e. standard – communication interface.

Users can already send the audio signature of a song captured from the air to and receive ack all sort of information regarding the song from a service. This is something that can already be implemented in a standard fashion using the MPEG-7 Audio Descriptors.

In the “image” domain MPEG has made progress compared to the “old” MPEG-7 Visual capabilities. This has been achieved with two amendments of MPEG-7 Visual: image signature tools and video signature tools. These descriptors provide a “fingerprint” that uniquely identifies image and video content. They are robust, in the sense that their value is not affected, across a wide range of common editing operations. However, they are also sufficiently different for every item of “original” content to allow unique and reliable identification of the image or the video.

Image Signature is a content-based descriptor designed for the fast and robust identification of the same or modified image on the web-scale or in databases. Also known as fingerprint, Image Signature has a strong advantage over watermarking techniques in that it does not require any modification of the content and can be used readily with all existing content. Image Signature combines two complementary approaches in image representation:

  1. A global signature, where the signature is extracted from the entire image
  2. A local approach, where a set of local signatures are extracted at salient points in the image.

Image Signature has been tested to a wide range of common modifications, such as text/logo overlay, rotation, cropping, colour changes, etc. achieving an overall success rate of ~99.29% at a false alarm rate of less than 0.05 parts-per-million for the global signature, and ~98.04% at a false alarm rate of less than 10 parts-per-million for the complete signature. Search speed is in the order of 80 million and 100,000 matches per second for the global and complete signatures respectively. The Image Signature is also extremely compact as it requires only 1024 bits per image for the global signature and up to 7424 bits for the complete signature..

Video Signature enjoys similar features as Image Signature. Key technical aspects area combined dense (video-frame-level) and sparse (video-segment-level) description approach, allowing flexible multi-stage matching schemes, and a custom descriptor compression scheme, to facilitate efficient storage and transmission of the Video Signature metadata. Video Signature as been tested to a wide range of commonly performed modifications, e.g. text/logo overlay, camera capture (camcording), compression al low bitrates, resolution reduction, frame rate changes, etc. achieved an overall success rate of ~95.49% at a false alarm rate of less than 5 parts-per-million. Video Signature allows for very high extraction and matching speeds, and very low storage and transmission requirements, at only ~2MB per hour of video content.

Another important example is provided by “Visual Search” which is usefully represented by the following use case. A user who wants to know more about an object takes a picture of it and sends it to a service . The service interprets the object, uses this interpretation to search in a knowledge data base and possibly responds with a number of additional information elements from a variety of viewpoints.

MPEG-7 Visual already provides a number of descriptors for video, audio and multimedia that are useful for this task. However, more is needed to solve the challenging problem of enabling users to obtain the desired information by using both their mobile and non-mobile devices to take a picture, extract sophisticated descriptors from the picture without depleting the battery, and send the data to the service provider of choice without clogging the network. This is the task of the Compact Descriptors for Visual Search standard (MPEG-7 Part 13).

One thought on “Machines Begin To Understand The World

  1. Pingback: The Roadmap | Riding the Media Bits