All posts by admin

Compressed Digital Is Better

A new technology to store 650 MByte of data on a CD had been developed based on the same basic principles used by the telco industry for their optical fibres, but the way it had been implemented had brought the price down to levels that the telco industry could only dream of. Storing that impressive amount of data on a 12-cm physical carrier was a done deal, but transmitting the same data over a telephone line, would take a long time, because the telephone network had been designed to carry telephone signals with a nominal bandwidth of just 4 kHz even without considering the instability of analogue networks. 

So, the existing telephone network was unsuitable for the future digital world, but what should be an evolutionary path to a network handling digital signals? Two schools of thought formed, the visionaries and the realists. The first school aimed at replacing the existing network with a brand new optical network, capable of carrying hundreds – maybe more they thought at time – Mbit/s per fibre. This was technically possible and had already been demonstrated in the laboratories, but the issue was bringing the cost of that technology to acceptable levels, as the CE industry had done with the CD. The approach of the second school was based on more sophisticated considerations (this does not imply that the technology of optical fibres was not sophisticated): if the signals converted to digital form are indeed bandwidth-limited, as they have to be if the Nyquist theorem is to be applied, contiguous samples are unlikely to be radically different from one another. If the correlation between contiguous samples can be exploited, it should be possible to reduce the number of bits needed to represent the signals without affecting its quality too much. 

Both approaches required investments in technology, but of a very different type. The former required investing in basic material technology, whose development costs had to be borne by suppliers and the deployment costs by the telcos (actually, because of the scarce propensity of manufacturers to invest on their own, telcos would have practically to fund that research as well). The latter required investing in algorithms to reduce the bitrate of digital Audio and Video (AV) signals and devices capable of performing what was expected to be a very high number of computations per second. In the latter case it could be expected that the investment cost could be shared with other industries. Of course either strategy, if lucidly implemented, could have led the telco industry to world domination, given the growing and assured flow of revenues that the companies providing the telephone service enjoyed at that time. But you can hardly expect such a vision from an industry accustomed to cosset Public Authorities to preserve its monopoly. These two schools of thought have dominated the strategy landscape of the telco industry, the former having more the ears of the infrastructure portion of the telcos and the latter having more the ears of the service portion. 

The first practical application of the second school of thought was the facsimile. The archaic Group 1 and Group 2 analogue facsimile machines required 6 and 3 minutes, respectively, to transmit an A4 page. Group 3 was digital and designed to transmit an A4 page scanned with a density of 200 Dots Per Inch (DPI) with a standard number of 1728 samples/scanning line and a number of scanning lines of about 2,300 if the same scanning density is used vertically (most facsimile terminals, however, are set to operate at a vertical scanning density of ½ the horizontal scanning density).

Therefore, if 1 bit is used to represent the intensity (black or white) of a dot, about 4 Mbits are required to represent an A4 page. It would take some 400 seconds to transmit this amount of data using a 9.6 kbit/s modem (a typical value at the time Group 3 facsimile was introduced), but this time can be reduced by a factor of about 8 (the exact reduction factor depends on the specific content of the page) using a simple but effective scheme based on the use of: 

  1. “Run lengths” of black and white samples: instead of sample values.
  2. Different code words to represent black and white run lengths: their statistical distribution is not uniform (e.g. black run lengths are more likely to be shorter than white ones). This technique is called Variable Length Coding (VLC). 
  3. Different VLC tables for black and white run lengths: black and white samples on a piece of paper have different statistics. 
  4. Information of the previous scan line: if there is a black dot on a line, it is more likely that there will be a black dot on the same vertical line one horizontal scanning line below than if there had been a white dot. 

Group 3 facsmile wa the first great example of successful use of digital technologies in the telco terminal equipment market. 

But let’s go back to speech, the telco industry’s bread and butter. In the early years, 64 kbit/s for a digitised telephone channel was a very high bitrate in the local access line, but not so much in the long distance, where bandwidth, even without optical fibres, was already plentiful. Therefore, if speech should ever reach the telephone subscriber in digital digital form, a lower bitrate had to be achieved.

The so-called Differential PCM Pulse Code Modulation (DPCM) was widely considered in the 1970s and ’80s because the technique was simple although with a moderate compression. DPCM was based on the consideration that, since the signal has a limited bandwidth, consecutive samples will not be so different from one another, and sending the VLC of the difference of the two samples will statistically require fewer bits. To avoid error accumulation, however, instead of subtracting the previous sample from the actual sample, an estimate of the previous sample is subtracted, as shown in the figure below. Indeed, it is possible to make quite accurate estimates because speech has some well identified statistical characteristics given by the structure of the human auditory system. Taking into account the sensitivity of the human ear, it is convenient to quantise the difference finely if the difference value is small and more coarsely if the value is larger before applying VLC coding. The decoder is very simple as the output is the running sum of each differential sample and the value of reconstructed samples filtered through the predictor.

DPCM_encoder DPCM_decoder

Figure 1 – DPCM encoder and decoder

 DPCM was a candidate for video compression as well but the typical compression ratio of 8:3 deemed feasible with DPCM was generally considered inadequate for the high bitrate of digital video signals. Indeed, using DPCM one could hope to reduce the 216 Mbit/s of a digital television signal down to about 70 Mbit/s, a magic number for European telcos because it was about 2 times 34 Mbit/s, an element of the European digital transmission hierarchy. For some time this bitrate was fashionable in the European telecom domain because it combined the two strategic approaches: DPCM running at a clock speed of 13.5 MHz was a technology within reach of the early 1980s and 70 Mbit/s was “sufficiently high” to still justify the deployment of optical fibres to subscribers. 

To picture the atmosphere of those years, let me recount the case when Basilio Catania, then Director General of CSELT and himself a passionate promoter of optical networks, opened a management meeting by stating that because 210 Mbit/s was going to be feasible as subscriber access soon (he was being optimistic, it is becoming feasible even now, some 45 years later), and because 3 television programs was what a normal family would need, video had to be coded at 70 Mbit/s. The response to my naïve question if the solution just presented was the doctor’s prescription, was that I was no longer invited to the follow-up meetings. Needless to say that the project of bringing 3 TV channels to subscribers went nowhere. 

Another method, used in music coding, subdivided the signal bandwidth in a number of sub-bands, each of which is quantised with more or less accuracy depending on the sensitivity of the ear to the particular frequency band. For obvious reasons this coding method was called Sub-band Coding (SBC).

Yet another method used the properties of certain linear transformations. A block of N samples can be represented as a point in an N-dimensional space and a linear transformation can be seen as a rotation of axes. In principle each sample can take any value within the selected quantisation range, but because samples are correlated, these will tend to cluster around a hyper-ellipsoid in the N-dimensional space. If the axes are rotated, i.e. an “orthogonal” linear transformation is applied, each block of samples will be represented by different numbers, called “transform coefficients”. In general, the first coordinate will have a large variance, because it corresponds to the axis of the ellipsoid that is most elongated, while the subsequent coordinates will tend to have lesser variance.

axis_rotation

Figure 2 – Axis rotation in linear transformation

If the higher-variance transformed coefficients (the u axis in the figure) are represented with higher accuracy and lower-variance coefficients are represented with lower accuracy, or even discarded, a considerable bit saving can be achieved, without affecting the values of the samples too much when the original samples are reproduced approximately, using available information, by applying an inverse transformation. 

The other advantage of transform coding, besides compression, was the higher degree of control of the number of bits used that could be obtained compared with DPCM. If the number of samples is N and one needs a variable bitrate scheme between, say, 2 and 3 bits/sample, one can flexibly assign the bit payload between 2N and 3N. 

A major shortcoming of this method was that, in the selected example, one needs about NxN multiplications and additions. The large number of operations was compensated by a rather simple add-multiply logic. “Fast algorithms” were invented that needed a smaller number of multiplications (about Nxlog2(N)), but required a considerable amount of logic driving the computations. An additional concern was the delay intrinsic in transform coding. By smartly programming the instructions, this roughly corresponds to the time it takes to build the block of samples. If the signal is, say, music sampled at 48 kHz, and N=1,024, the delay is about 20 ms, definitely not a desirable feature for real-time communication, of some concern in the case of real-time broadcasting and of virtually no concern for playback from a storage device. 

The analysis above applies particularly to a one-dimensional signal like audio. Pictures, however, presented a different challenge. In principle the same algorithms as used in the one-dimensional (1D) audio signal could be applied to pictures. However, transform coding of long blocks of 1D picture samples (called pixel, from picture element) was not the way to go because image signals have a correlation that dies out rather quickly, unlike audio signals that are largely oscillatory in nature and whose frequency spectrum can then be analysed using a large number of samples. Applying linear transformations to 2D blocks of samples was very effective, but this required storing at least 8 or 16 scanning lines (the typical block size choices) of 720 samples each (the standard number of samples of a standard television signal), a very costly requirement in the early times (second half of the 1970s) when the arrival of the 64 Kbit RAM chip after the 16 Kbit Random Access Memory (RAM) chip, although much delayed, was hailed as a great technology advancement (today we easily have 128 GByte RAM chips) . 

Eventually, the Discrete Cosine Transform (DCT) became the linear transformation of choice for both still and moving pictures. It is therefore a source of pride for myself that my group at CSELT was one of the first to investigate the potential of linear transforms for picture coding, and maybe the first to do so in Europe on a non-episodic fashion. In 1979 my group implemented one of the first real-time still-picture transmission systems that used DCT and exploited the flexibility of transform coding by allowing the user to choose between transmitting more pictures per unit of time at a lesser quality or fewer pictures at a higher quality.

Pictures offered another dimension compared to audio. Correlation within a picture (intra-frame) was important, but much more could be expected by exploiting inter-picture (inter-frame) correlation. One of the first algorithms considered was called Conditional Replenishment coding. In this system a frame memory contains the previously coded frame. The samples of the current frame are compared line by line, using an appropriate algorithm, with those of the preceding frame. Only the samples considered to be “sufficiently different” from the corresponding samples of the previously coded frames were compressed using intra-frame DPCM and placed in a transmission buffer. Depending on the degree of buffer fullness ( “many/few changes, many/few bits produce”), the “change detection” threshold can be raised or lowered to produce more or fewer bits. 

A rather sophisticated algorithm was the so-called Motion Compensation (MC) video coding. In its block-based implementation (day, 8*8 samples), the encoder looks for the best match between a given block of samples of the current frame and a block of samples of the preceding (encoded) frame. For practical reasons the search is performed only within a window of reduced size. Then the differences between the given block and the “motion-compensated” block of samples are encoded using again a linear transformation. From this explanation it is clear that inter-frame coding requires the storage of a full frame, i.e. 810 Kbyte for a digital television picture of 625 lines. When the cost of RAM decreased sufficiently, it became possible to promote compressed digital video as a concrete proposition for video distribution. 

From the 1960s a considerable amount of research was carried out on compression of different types of signals: speech, facsimile, videoconference, music, television. The papers presented at conferences or published in academic journals can be counted by hundreds of thousand and the filed patents by the tens of thousands. Several conferences developed to cater to this ever-increasing compression coding research community. At the international level, the International Conference on Acoustics, Speech and Signal Processing (ICASSP), the Picture Coding Symposium (PCS) and the more recent International Conference on Image Processing (ICIP) can be mentioned. Many more conferences exist today on special topics or at the regional/national level. Several academic journals dealing with coding of audio and video also exist. In 1988 I started “Image Communication”, a journal of the European Signal Processing Association (EURASIP). During my tenure as Editor-in-Chief of that journal until 1999, more than 1,000 papers were submitted for review to that journal alone.

I would like to add a few words about the role that CSELT and my lab in particular had in this space. Already in 1975, my group had received the funds to build a simulation system that had A/D and D/A converters for black and white and PAL television, a solid state memory of 1 Mbyte built with 4 kbit RAM chips, connected to a 16-bit minicomputer equipped with a Magnetic Tape Unit to move the digital video data to and from the CSELT mainframe where simulation programs could be run.

In the late 1970s the Ampex digital video tape recorder promised to offer the possibility to store an hour of digital video. This turned out not to be possible because the input/output of that video tape recorder was still analogue. With the availability of the 16 kbit RAM chips we could build a new system called Digital Image Processing System (DIPS) that boasted a PDP 11/60 with 256 Kbyte RAM interfaced to the Ampex digital video tape recorder and a 16 Mbyte RAM for real-time digital video input/output.

DIPS1

Fig. 3 – The DIPS Image Processing System

But things had started to move fast and my lab succeeded in building a new simulation facility called LISA by securing the latest hardware: a VAX 780 interfaced to a system made of two Ampex disk drives capable of real-time input/output of digital video according to ITU-R Recommendation 656 and, later, a D1 digital tape recorder.

LISA1

Fig. 4 The LISA Image Processing System


The First Digital Wailings

ITU-T has promulgated a number of standards for digital transmission, starting from the foundational standard for digital representation of telephone speech at 64 kbit/s. This is actually a two-edged cornerstone because the standard specifies two different methods to digitise speech: A-law and µ-law. Both use the same sampling frequency (8 kHz) and the same number of bits/sample (8) but two different non-linear quantisation characteristics that take into account the logarithmic sensitivity of the ear to audio intensity. Broadly speaking, µ-law used to be the method of choice in North America and Japan and A-law in the rest of the world. 

While any telephone subscriber could technically communicate with any other telephone subscriber in the world, there were differences on how the communication was actually set up. If the two subscribers wishing to communicate belonged to the same local switch, they were directly connected via that switch. If they belonged to different switches, they were connected through a long-distance link where a number of telephone channels were “multiplexed” together. The size of the multiplexer depended on the likelihood that a subscriber belonging to switch A wanted to connect to a subscriber belonging to switch B at the same time. This pattern was then repeated through a hierarchy of switches. Today, with the general use of the Internet Protocols, things are different but the basic idea is the same.

When telephony was analogue, multiplexers were implemented using a Frequency Division Multiplexing technique where each telephone channel was assigned a 4 kHz slice. The hierarchical architecture of the network did not change with the introduction of digital techniques. The difference was in the technique used, Time instead of Frequency Division Multiplexing instead. With this technique, a given period of time wass assigned to the transmission of 8 bits of a speech sample of a telephone channel, followed by a sample of the next telephone channel, etc. Having made different choices at the starting point, Europe and USA (with Japan also asserting their difference with yet another multiplexing hierarchy) kept on doing so with the selection of different transmission multiplexers: the primary USA multiplexer had 24 speech channels – also called Time Slots – in the multiplexer, plus 8 kbit/s of other data for a total bitrate of 1,544 kbit/s. The primary European multiplexer had 30 speech channels plus two non-speech channels for a total bitrate of 2,048 kbit/s. 

While both forms of multiplexing did the job they were expected to do, the structure of the 1,544 kbit/s multiplex was a bit clumsy, with 8 kbit/s inserted in an ad hoc fashion in the 1,536 kbit/s of the 24 Time Slots. Instead, the structure of the 2,048 kbit/s is cleaner because the zero-th Time Slot (TS 0) carries synchronisation, the 16th TS (TS 16) is used for network signalling purposes and the remaining 30 TSs carry the speech channels. The American transmission hierarchy is 1.5/6/45 Mbit/s and the European hierarchy is 2/8/34/140 Mbit/s.

The decision to bifurcate was a deliberate decision of the European PTT Administrations, each having strong links with national manufacturers (actually two of them, as was the policy at that time, so as to retain a form of competition in procurement), as they feared that, by adopting a worldwide standard for digital speech and multiplexing, their industries would succumb to the more advanced American industry. These different communication standards notwithstanding, the ability of end-users to communicate was not affected because speech was digitised only for the purpose of core network transmission, while end-user devices continued to receive and transmit analogue speech. 

The development of Group 3 facsimile was driven by the more enlightened approach of providing an End-To-End (E2E) interoperable solution. This is not surprising because the “service-oriented” portions of the telcos and related manufacturers, many of which were not typical telco manufacturers, were the driving force for this standard that sought to open up new services. The system is scarcely used today but hundreds million devices were sold. 

The dispersed world of television standards found its unity again – sort of – when the CCIR approved Recommendations 601 and 656 related to digital television. Almost unexpectedly, agreement was found on a universal sampling frequency of 13.5 MHz for luminance and 6.75 MHz for the colour difference signals. By specifying a single sampling frequency, NTSC and PAL/SECAM could be represented by a bitstream with an integer number of samples per second, per frame and per line. The number of active samples is 720 for Y (luminance) lines and 360 for U and V (colour difference signals) lines for a television system with a 4/3 aspect ratio (width/height). This format was also called 4:2:2 where the three numbers represent the ratio of the Y:U:V (luminance and two chrominance signals) sampling frequencies. This sort of reunification, however, was achieved only in the studio where Recommendation 656 was latgely confined because 216 Mbit/s was such a high bitrate. Recommendations 601 and 656 are linked to Recommendation 657, the digital video recorder standard known as D1. While this format was not very successful market-wise, it played a fundamental role in the picture coding community because it was the first device available on the market that enabled storage, exchange and display of digital video compression results without the degradation introduced by analogue storage devices.

ITU-T also promulgated several standards for compressed speech and video. One of them was the 1.5/2 Mbit/s videoconference standard of the “H.100 series”, important because it was the first international standard for a digital audio-visual communication terminal. This transmission system was originally developed by a European collaborative project called COST 211 (COST stands for Collaboration Scientifique et Technique and 211 stands for area 2 – telecommunication, project no. 11) in which I represented Italy. 

It was a remarkable achievement because the project designed a complete terminal capable of transmitting audio, video, facsimile and other data, including End-to-End signalling. The video coding, very primitive if seen with today’s eyes, was a full implementation of a DPCM-based Conditional Replenishment scheme. My group at CSELT developed one of the four prototypes, the other three being those of British Telecom, Deutsche Telekom and France Telecom (these three companies had different names at that time, because they were government-run monopolies). The prototypes were tested for interoperability using 2 Mbit/s links over communication satellites, another first. The CSELT prototype was an impressive two 6U racks made of Medium Scale Integration and Large Scale Integration circuits that even contained a specially-designed Very Large Scale Integration (VLSI) circuit for change detection that used the “advanced” – for the late 1970s – 4 µm geometry! With the numbers running nowadays that are smaller by 3 orders of magnitude, these results look risible, but we are talking of some 45 years ago. 

Videoconferencing, international by definition because it served the needs of business relations trying to replace long-distance travel with a less time- and energy-consuming alternative, was soon confronted with the need to deal with the incompatibilities of television standards and transmission rates that nationally-oriented policies had built over the decades. Since COST 211 was a European project, the original development assumed that video was PAL and transmission rate was 2,048 kbit/s. The former assumption had an important impact on the design of the codec, starting from the size of the frame memory.

By the time the work was being completed, however, it was belatedly “realised” that the world had different television standards and different transmission systems. A solution was required, unless every videoconference rooms in the world was equipped with cameras and monitors of the same standard – PAL – clearly wishful thinking. Unification of video standards is something that had been achieved – for safety reasons – in the aerospace domain, where all television equipment is NTSC, but it was not something that would ever happen in the fragmented business world of telecommunication terminals and certainly not in the USA where even importing non-NTSC equipment was illegal at that time. 

A solution was found by first making a television “standard conversion” between NTSC and PAL in the coding equipment, then use the PAL-based digital compression system as originally developed by COST 211, and output the signal in PAL, if that was the standard used at the receiving end, or make one more standard conversion in output, if the receiving end was NTSC. Extension to operate at 1.5 Mbit/s was easier to achieve because the transmission buffer just tended to fill up more quickly than with a 2 Mbit/s channel, but the internal logic remained unaltered. 

At the end of the project, the COST 211 group realised – belatedly – that there were better performing algorithms – linear transformation with motion compensation – than the relatively simple but underperforming DPCM-based Conditional Replenishment scheme used by COST 211. This was quite a gratification for a person who had worked on transform coding soon after being hired and had always shunned DPCM as an intraframe video compression technology without a future in mass communication. Instead of doing the work as a European project and then trying to “sell” the results to the international ITU-T environment – not an easy task, as had been discovered with H.100 – the decision was made to do the work in a similar competitive/collaborative environment as COST 211 but within the international arena as a CCITT “Specialists Group”. This was the beginning of the development work that eventually gave rise to the ITU-T Recommendation H.261 for video coding at px64 kbit/s (p=1,…, 30). An extension of COST 211, called COST 211 bis, continued to play a “coordination” role of European participation in the Specialists Group.

The first problem that the Chairman Sakae Okubo, then with NTT, had to resolve was again the difference in television standards. The group decided to adopt an extension of the COST 211 approach, i.e. conversion of the input video signal to a new “digital television standard”, with the number of lines of PAL and the number of frame/second of NTSC – using the commendable principle of “burden sharing” between the two main television formats. The resulting signal, digitised and suitably subsampled to 288 lines of 352 samples at a frame rate of 29.97 Hz, would then undergo the digital compression process that was later specified by H.261. The number 352 was derived from 360, ½ the 720 pixels of digital television, but with a reduced number of pixels per line because it had to be divisible by 16, a number required by the block-coding algorithm of H.261. 

I was part of that decision made at the CSELT meeting in 1984. At that time, I had already become hyper-sensitive to the differences in television standards. I thought that the troubles created by the divisive approach followed by our forefathers had taught a clear enough lesson and the world did not need yet another, no matter how well-intentioned, television standard, even if it was confined inside the digital machine that encoded and decoded video signals. I was left alone and the committee went on approving what was called “Common Intermediate Format”. This regrettable decision set aside, H.261 remains the first example of truly international collaboration in the development of a technically very complex video compression standard. 

Common_Intermediate_Format

Figure 1 – The Common Intermediate Format

The “Okubo group”, as it was soon called, made a thorough investigation of the best video coding technologies and assembled a reasonably performing video coding system for bitrates, like 64 kbit/s, that were once considered unattainable (even though some prototypes based on proprietary solutions had already been shown before). The H.261 algorithm can be considered as the progenitor of most video coding algorithms commonly in use today, even though the equipment built against this standard was not particularly successful in the market place. The reason is that the electronics required to perform the complex calculations made the terminals bulky and expensive. No one dared make the large investments needed to manufacture integrated circuits that would have reduced the size of the terminal and hence the cost (the rule of thumb at that time used to be that the cost of electronics is proportional to its volume). The high cost made the videoconference terminal a device centrally administered in companies, thus discouraging impulse users. Then the video quality when using H.261 at about 100 kbit/s, the bitrate remaining when using 2 Integrated Service Digital Network (ISDN) time slots after subtracting the bitrate required by audio (when this was actually compressed) and other ancillary signals, was far from satisfactory for business users, since consumers had already been scared away by the price of the device. 


Digital Technologies Come Of Age

Speech digitisation was driven by the need to manage long-distance transmission systems more effectively. But the result of this drive also affected end users because, thanks to digital technologies, the random level of speech quality that had plagued telephony since the day it was invented began to have less consequences on the perceived communication quality. Later the CCITT adopted the Group 3 facsimile standard that offered considerable improvement to end users, business first and general consumers later. Then Philips and Sony on the one hand, and RCA on the other, began to put on the market the first equipment that carried bits with the meaning of music to consumers’ homes. 

As in the case of speech digitisation, digital technologies in the Consumer Electronics space were not primarily intended to offer end users something really new. The CD was just another carrier, more compact, lighter to distribute, and with a quality claimed to be indistinguishable from the studio’s. The opinion of some consumers, however, seems to indicate that CD quality is not necessarily the issue because some claim that the sound of the Long Playing (LP) record is better that that of CD. In other words, the drivers to both digital speech and digital music were the stability, and reduced manufacturing and maintenance cost offered by digital technology, not quality. This last feature was just a by-product. 

In the same years of the CD (1982) the CCITT published its Recommendation H.100 that enabled videoconferencing through the use of 1.5/2 Mbit/s primary multiplexers as carriers of compressed videoconference streams. Videoconferencing was not unknown at that time because several telecommunication operators, and broadcasting companies as well, had videoconference trials – all using analogue techniques – but there was hope that H.100 would eventually enable a diffused use of a business communication form that at that time was little more than a curiosity. This was followed in the mid 1980s by the beginning of the standardisation activity that would give rise to CCITT Recommendations H.261 (video coding at px64 kbit/s) and H.221 (media multiplex) together with other CCITT Recommendations for coding speech sampled at a bitrate less than or equal to 64 kbit/s, in some cases with a wider speech bandwidth than 4 kHz. These activities were synergistic with the huge CCITT standardisation project known as ISDN that aimed to bring 144 kbit/s to subscribers using existing telephone lines. 

In the mid 1980s several Consumer Electronics laboratories were studying methods to digitally encode audio-visual signals for the purpose of recording them on magnetic tapes. One example was the European Digital Video Recorder project, originally a very secretive project that people expected would provide a higher-quality alternative to the analogue VHS or Betamax videocassette recorder, as much as the CD was a higher-quality alternative to the LP record. Still in the area of recording, but for a radically new type of application – interactive video on compact disc – Philips and RCA were studying methods to encode video signals at bitrates of 1.4 Mbit/s fitting in the output bitrate of their Compact Disks. 

Laboratories of broadcasting companies and related industries were also active in the field of audio and video coding for broadcasting purposes. The Commission Mixte pour les Transmission Télévisuelles et Sonores (CMTT), a special Group of the ITU dealing with issues of transmission of radio and television programs on telecommunication networks, then folded into ITU-T as Study Group 9 and now merged with Study Groups 16 – Multimedia coding, systems, and applications (2026-2028 Study Period) had started working on transmission of compressed digital television for “primary contribution” (i.e. transmission between studios). At the end of the 1980s, RAI and Telettra (then an Italian manufacturer of telecommunication equipment), had developed an HDTV codec for satellite broadcasting that was used for very impressive demonstrations during the Soccer Worldcup hosted by Italy in 1990. Slightly later, General Instrument (GI) – a manufacturer of Cable TV equipment, acquired by Motorola, which was partly acquired by Google, partly sold to Arris Group, acquired by CommScope – had showed its Digicipher II system for terrestrial HDTV broadcasting in the very narrow 6 MHz bandwidth used in American terrestrial television. 

This short list of examples shows how, at the end of the 1980s, the telco, CE and broadcasting industries had already embarked in implementations, some of them at research level and some of an industrial value, that were based on digital technologies and provided products and companies and end users with the intention of consolidating or extending their own positioning in the communication and media businesses. The computer industry was missing from the list because, even if the computer industry has been the first to make massive use of data processing techniques, in the second half of the 1980s the computing machines within reach of end users – mostly Macintoshes and IBM compatible Personal Computers (PC) in the office, and Atari, Commodore 64 and Amiga at home – still needed at least one order of magnitude more processing power to provide their users with natural moving video of acceptable size and natural sound of acceptable quality. In January 1988, the IBM representatives at the JPEG meeting proudly showed how an IBM AT could decode in real time a DCT-encoded still picture at the bitrate of 64 kbit/s! 

This snapshot describes what technology could achieve in the 1980s, but says nothing of the mindsets of the people who had masterminded those developments. Beyond the superficial commonality of technological solutions, the different industries and, within each of these industries, different countries and regions of the world had fundamental differences of traditions, strategies and regulatory concerns. 

The telco industry placed great value in the existence of standard solutions, but was typically weak in end-user equipment. As terminal equipment was outside the telcos’ purview, terminals had on the one hand, to adhere to standards if they were intended to be connected to the “public network” and, on the other, had to be left to the goodwill of the manufacturing industry. Manufacturers had scarce inclination to invest in new terminal equipment because they were accustomed to receive guaranteed orders from operators at prices that could hardly be described as the result of fierce competition. The interest in digital solutions was linked to the digitisation prospects of the telephone network: basic-access ISDN (144 kbit/s) for the general user and at most primary-access ISDN (1.5 or 2Mbit/s) for the professional user. AS we have said, there continued to be an underground clash between those who were driven by the need to foster the evolution of the services – here and now – and those who assumed that the world would stay unchanged and cherished the dream of bringing optical fibres to end users with virtually unlimited bandwidth some time in the future. 

The CE industry did not feel particularly constrained by the existence of standards, as shown by the adoption of 44.1 kHz as the sampling frequency of compact disc audio, selected because analogue video recorders provided an easy means to digitally encode audio in the early phases of development. That industry, however, had the culture and financial muscle to design and develop user equipment for mass-market deployment that often required sophisticated and costly integrated circuits, when market needs so dictated. The weak point of that industry showed when equipment from several manufacturers that were functionally similar but technically incompatible appeared almost simultaneously in the market. Just the names V2000, Betamax and VHS, the three formats of home video cassette recorders and the battles that raged around them, should suffice to explain the concept. 

Even more complex was the attitude of the broadcasting industry. This was rigidly regulated in Europe and Japan and less visibly, but equally, if not more so, rigidly regulated in the USA. In Europe the Commission of the European Communities (at that time called CEC) had laid down the policy of evolution of television through Multiplexed Analogue Components (MAC) towards the European version of HDTV transmission called HD-MAC, both via satellite. In the USA and Japan the policy was one of evolution from analogue NTSC to analogue HDTV. In Japan, the introduction of HDTV was expected to happen via satellite while in the USA this was expected to happen as an evolution of the terrestrial network.


Carrying Bits

Because the computer industry was “born” digital, it was the first to be confronted with the problem of “mapping” digital data onto analogue carriers, i.e. storing bits on intrinsically analogue devices. One of the first solutions – storage of bits on paper tape – was limited to small quantities of data, typically programs, but magnetic technologies were more promising because one could use tapes, drums or disks to store data with vastly improved capacity. 

Magnetic tapes for sound recording had already achieved a considerable degree of maturity, as they had been in existence for some time. The difference was that sound was analogue and intrinsically band-limited so that a suitable transducer could convert a current or a voltage directly into a magnetic field and vice versa, while binary data from computers have a theoretically infinite bandwidth. The obvious solution was to “modulate”, i.e. pre-filter, the binary data so as to minimise the interference between the binary data entering the storage device, caused by their “infinite” bandwidth. Further, to enable the identification and correct writing/reading of the data, the information had to be “formatted”, i.e. had to be given a precise structure, in very much the same way as a text is formatted in lines, paragraphs and pages. 

Not long after the computer industry had been confronted with the problem of storing digital data on analogue media, the telecommunication industry was confronted with the similar need of “sending” digital data through the analogue transmission medium called telephone cable. One example is provided by the elements of the digital transmission hierarchy where the equivalent of the magnetic disk or tape formatting is the “frame”. The primary A-law based multiplexer has a frame of 256 bits (32 time slots each of 8 bits), where TS 0 has a fixed pattern so that it can act as a “start code” for a receiving device to know where to look in order to get the start of the frame (of course there is no guarantee that this code word cannot be emulated by unconstrained telephone data because it is possible that in a frame one speech sample has exactly that value). Higher-order multiplexers are organised in a similar manner, in the case of the European hierarchy, by multiplexing 4 lower-order streams. 

The COST 211 project mentioned above did not just develop the video coding part but provided the specification of the complete transmission system for videoconference applications. In the COST 211 system, TS 1 carries speech, TS 17 optionally carries a digital facsimile signal and TS 18 optionally carries other data. Several additional types of information have to be conveyed from one terminal to the other, such as information on whether TS 17 and TS 18 contain video data or facsimile and other data. 

Transmitting digital data over telephone lines is conceptually the same as storing data on a local storage device, but in general the “modulation” schemes have to be much more sophisticated because the telephone line is such a bandwidth-limited transmission system. A telephone line has a nominal bandwidth of 4 kHz (actually the transmitted speech signal has significant energy only in the 300 to 3,400 Hz band). It also has unpredictable characteristics caused by extremely variable operating conditions and by the fact that telephone cable have been deployed over long span of time. On the other hand, magnetic tapes and disks have better-defined characteristics thanks to well-monitored manufacturing processes and more predictable operating conditions. 

The initial modulation schemes supported a low bitrate transmission of 300 baud (“bit/s” is also called “baud”, frorm Émile Baudot, the inventor of the telegraphy code that had supplanted Morse Code). Later, higher bitrates became progressively possible thanks to “adaptive” schemes that automatically adapted their performance to the characteristics of the line. More and more sophisticated schemes were developed and the bitrate climbed to several kbit/s but always as a multiple of 300. A set of widely used ITU-T recommendations started making it possible for a new generation of nomadic users to connect from anywhere to anywhere else in the world over distances of possibly thousands of kilometres at values as high as 56 kbit/s depending on the end-to-end link “quality”. 

An ambitious goal that the telco industry set to itself in the late 1960s was the development of ISDN. The plan was to provide telephone subscribers with two 64 kbit/s channels (so-called B-channels) and one 16 kbit/s of signalling (so-called D-channel) for a total of 144 kbit/s (so-called 2B+D). With the usual schizophrenia of the telco business, ISDN was not fully defined in all its parts. In particular the modulation scheme for the local access was left to each telco. Assuming that users are static (a reasonable assumption of that time) this is not an unreasonable assumption, but it is the one that prevented later laptops from using ISDN, one of the reasons why eventually ISDN did not fly, not even in countries where significant levels of deployment had been achieved. 

At the end of the 1980s, while the ISDN standardisation project was drawing to a close, some telco R&D laboratories showed the first results of what should have been great news for companies whose assets were buried underground in the form of millions of kilometres of telephone cable. The technology was called Asymmetric Digital Subscriber Line (ADSL), which would allow downstream (central office to subscriber) transmission of “high” bitrate data, e.g. 1.5 or 2 Mbit/s, with a lower-rate upstream transmission, e.g. 64 or 128 kbit/s from the subscriber terminal to the central office. 

One instance of this technique used a large number of carriers placed in appropriate parts of the spectrum after an initial phase where the transmitter checked the state of the line by interacting with the receiver. This type of research work was generally ostracised within the telcos because it provided an alternative and competing solution, in terms of cost certainly not of performance and deployment time, to the old telco dream of rewiring all their subscribers with optical fibres for some yet-unknown-but-soon-to-come pervasive “broadband” applications. Later, ADSL provided a still “asymmetric” access (typically 5-10 times more bit/s downstream than upstream) to hundreds millions of subscribers around the world at increasing bitrates. ADSL played a major role in allowing fixed telephony service providers to survive.

If one sees the constant progress that magnetic disks have been making in terms of storage capacity and the snail-like progress of ADSL, one could be led to think that the telco industry is simply not trying hard enough. While there is some truth in this statement ;-), one should not forget the fact that while manufacturing of hard disks happens in clean rooms, the local telephone access has to deal with a sometimes many decade-old infrastructure deployed with wires of varying quality, different installation skills and unpredictable operating conditions. It is definitely not a fair comparison. 

If a comparison has to be made, it is with optical fibers used for long-distance transmission. In this case it is easy to see how the rate of increase in bitrates is even higher than the rate of increase in hard disk capacity. Again this is possible because optical fibres are a new technology and fibres are probably manufactured with as much care and using equipment as sophisticated and expensive as those used in high-capacity magnetic disk manufacturing. The problem is that, of the long distance fibres that were deployed in years of collective madness at the end of 1990s, only a few percent is actually lit, a fact that is clearly not disconnected from the slow introduction of broadband in the local loop. This underutilisation also shows the difference between the concrete advantage felt by a consumer buying a hard disk today with twice the capacity with the same or lower price compared to the year before, versus financial decisions that telco executives made based on expectations of long-distance traffic that depended on the coming of some gainful “broadband application” Messiah. 

The last physical delivery system considered in this list is the coaxial cable used for Community Antenna Television (CATV), a delivery infrastructure originally intended for distribution of analogue television signals. For this, the widely chosen modulation system is Quadrature Amplitude Modulation (QAM).  The CATV industry has made great efforts to “digitise” the cable in order to be able to provide digital interactive services. The Data Over Cable Service Interface Specification (DOCSIS) still provides high bitrates to CATV subscribers.

Another transmission medium rivalling the complexity of the local access telephone line is the VHF/UHF band used for television broadcasting on the terrestrial network. Already in the 1980s, several laboratories, especially in Europe, were carrying out studies to develop modulation methods that would enable transmission of digitised audio and television signals in the VHF/UHF bands. They came to the conclusion that such frequencies in typical conditions could carry between 1 and 4 bit/s per Hz depending on operating conditions. The lower the bitrate, the higher the terminal mobility and the higher the bitrate for fixed terminals. 

The modulation scheme selected in Europe and other parts of the world to digitise the VHF/UHF frequency bands is called Coded Orthogonal Frequency Division Multiplexing (COFDM). This injects a large number of carriers – up to several thousands, each carrying a small bitrate. It is a technology similar to ADSL, with the difference that in broadcasting no return channel is available to adapt the modulation scheme to the channel. In the USA a different system called 8 Vestigial Side Band (8VSB) – a single-carrier modulation system – was selected at the end of the 1990s for digital terrestrial broadcasting. 

For satellite broadcasting the typical modulation scheme is Quadrature Phase Shift Keying (QPSK), a modulation system where the carrier’s phase is shifted in 90° increments. 

Digital cellular phone systems have been widely deployed in several countries. The modulation system used for Global System for Mobile (GSM) is Time Division Multiple Access (TDMA), a multiple access technique where the access to the channel is based on time slots – like those used in digital telephony multiplexers – corresponding to digital channels, usually of a fixed bitrate. 

The 3rd generation (3G) mobile communication system was based on Code Division Multiple Access (CDMA), of which several incompatible flavours already exist (CDMA was used in some 2nd generation digital mobile telecommunication systems). This is a specialisation of a more general form of wireless communication called Spread Spectrum used in multiple-access communications, where independent users share a common channel without an external synchronisation. In this communication form the bitstream is spread throughout the available bandwidth using a periodic binary sequence, called Pseudo random Noise (PN) sequence. Because of this information scrambling, the bitstream appears as wide band noise. The receiver uses the same PN sequence as the transmitter to recover the transmitted signal and any narrow band noise is spread to a wide band signal.

3G was not the end of the story as it was followed by 4G and, more recently, 5G. Each generation has provided more bandwidth, but the distance between base stations (the hub connected to the larger that collects signals in an area). There is active research for the so-called 6G providing more bandwidths and features.


Telecom Bits And Computer Bits

Since the early times of computing, it became apparent that CPUs should be designed to handle chunks of bits called “bytes” instead of or in addition to individual bits, obviously without altering the status of bits as the atomic components of information. After some odd initial choices (like the 6 bits of the UNIVAC byte), the number of bits in a byte soon converged to 8 (hence bytes are sometimes called “octets”). With the progress of technology, CPUs became capable of dealing with more bytes at the same time. In the late 1960s and 1970s minicomputers were based on a two-byte architecture that enabled the CPU to address 64 bytes of memory. Today the CPU of some advanced game machines can handle many bytes at a time, even 64 bytes. 

When the telcos decided to digitise speech, they, too, defined their own “byte”, the speech sample. After some initial dithering between 7 and 8 bits – all in the closed environment of CCITT meeting rooms in Geneva, with Americans favouring 7 and Europeans 8 bits – the eventual choice was 8 bits. Unlike the computer world, however, in which most processing involves bytes, telecom bytes are generated at the time of analogue-to-digital (A/D) conversion, but then they are immediately serialised and kept in that form until they are converted to bytes just before digital-to-analogue (D/A) conversion. Because of the way D/A conversion works, the “natural” order of bits in a telecom byte is Most Significant Bit (MSB) to Least Significant Bit (LSB).

The order of bits in a byte really depends on the architecture of the particular computer that processes the bytes. The same ambiguity is found in multi-byte data where the identification of how bytes are stored in the computer’s memory is described by little or big-endian. In a big-endian system, the most significant value in the sequence is stored at the lowest storage address (i.e., first). In a little-endian system, the least significant value in the sequence is stored first.

Transmission also responds to very different needs than storage or processing. In the 1960s, telcos started using a serialised and comparatively high transmission rate of 1,544 or 2,048 kbit/s, but network equipment performed rather simple operations on such streams, one of the most important being the identification of “frame start”. Transmission channels are far from being error free and, as we have already said, the codeword identifying TS 0 can be emulated. This means that a receiver must be programmed to deal with the moment it is first switched on and when frame alignment is lost. The data that have flown in the meantime are, well, lost, but there is no reason to worry: after all it is just a few milliseconds of speech.

For quite some time the bitrate used for transmission of computer data over the network was limited to a few hundred kbit/s, but the network had to perform rather sophisticated operations on the data. Data transmission had to be error free, which means that codeword emulation had to be avoided or compensated and retransmission requested for all data that, for whatever reason, did not satisfy strict error checking criteria. 

Because the network did not have to perform complex operations on the speech samples (which does not mean that the logic behind the routing of those samples was simple), the transmission mode is “synchronous”. This means that the transmission channel can never be “idle” and requires that speech samples be organised in fixed-length “frames”, where a frame is immediately followed by another frame. Most networks derive the clock from the information flowing through it, but what happens if there is no speech and all bits are then set to zero? To avoid the case when it is impossible to derive the clock, every other bit of speech samples are inverted. Computer networks, on the other hand, transmit data in frames of variable length called “packets”.

This is an area where I had another techno-ideological clash with my telco colleagues in Europe. While the work on H.261 was progressing, COST 211 bis was discussing ways to multiplex the same data that the original COST 211 project had found necessary: audio, facsimile, text messages and, because things were changing, even some computer data arriving through one of those funny multiples of 300 bit/s rates used in telephony modems. With all the respect I had for the work done in COST 211 (to which, by the way, I and my people had been major contributors), where data multiplexing was done in the best telco tradition of “frames” and “multiframes”, I thought that there should be more modern and efficient – i.e. packet-based – ways of multiplexing data. 

In COST 211 I had already proposed the use of a packet transmission system for exchanging messages between terminals and the Multi-Conference Unit, a device that performed the management of “multiconference”, i.e. a videoconference with more than 2 users. The message proposal had been accepted by COST 211, but this was not surprising because in telcos the “signalling” function was dealt with by people with friendly ears to the IT language. My new proposal to define a packet-based multiplexer for media, however, was made on a completely different environment and did fell on deaf (or closed) ears. This is why H.211, the multimedia multiplexer used for H.261, was the latter-day survivor of another age: it organised non-video data in chunks of 8 kbit/s subchannels and each of these subchannels has its own framing structure that signalled which bit in the frame was used for which purpose. It is unfortunate that there is no horror museum of telecom solutions because this would probably sit in the centre.  

There were two reasons for this. The first, and more obvious, is because there are people who, having done certain things in a certain way throughout their life time, simply do not conceive that the same things can possibly be done in a different way, particularly so if the new ideas come from younger folks, driven by some alien, ill-understood new discipline. In this case, my European colleagues were so accustomed to sequential processing of bits with a finite state machine that they did not conceive that there could be a microprocessor that would process the data stream in bytes and not in bits, instead of a special device designed on purpose to follow certain logic steps. The second reason is more convoluted. In some Post, Telephone and Telegraph (PTT) administrations, the state had retained the telegraph and postal services but had licensed the telephone service to a private firm, even though the latter was still under some form ot control by the state. In such environments, there was ground for an argument that “packet transmission” was akin to telegraphy and that telcos should therefore not automatically be given a licence to manage packet data transmission services. Those telcos were then afraid of losing what at that time was – rightly – considered the next telco frontier. 

This is what it means to be a regulated private company providing public services. Not many years ago, a time when the telco business was said to be unregulated – while the state happily put its nose in the telephone service price list – one could see different companies digging the same portion of the street more than once to lay the same cables to provide the same services, when doing it once would suffice for all. Or one could see different companies building the same base stations twice or thrice, when one would have been enough for all (and also with less power consumption and electromagnetic pollution). All this madness was driven by what I call “electric pole-driven competition” philosophy, under the watchful eye of the European Commission that made sure that no one even thought of “sharing the infrastructure”.

Yesterday, cables were laid once and antennae hoisted only once, but then the business had to be based on back-door dealings where bureaucrats issued rulings based on some arcane principles, after proxy battles intelligible – if ever – only by the cognoscenti. 

Frankly, I do not know which one of the two I like better. If I could express a desire, I would like a regulated world without brainless bureaucrats (I agree, it is not easy…), or a competitive world where the achievements of competition are not measured by the number of times city streets are dug to lay down the same cables belonging to different operators to offer the same services, but by the number of smart new services that are provided by different operators, obviously in competition, on the same plain old physical infrastructure. Actually there is room for sharing also some non-physical infrastructure, too, but that is another story. 

Until recently mine was a heretic view, but hard economic times have brought some resipiscence to some Public Authorities. There were at last second thoughts about imposing the building of separate mobile infrastructures by each operator and infrastructure sharing was no longer taboo. There is no better means to bring back sanity in people’s minds than the realisation that the bottom of the purse has been reached.

A further important difference between transmission in the telecom and computer worlds was that, when computers talk to computers via a network, they do so using a very different paradigm than the telephone network’s of that time. This is called connection-oriented because it assumed that, when subscriber A wants to talk to subscriber B, a unique path was ideally set up between the two telephone addresses by means of signalling between nodes (switches), that was maintained (and charged!) for the entire duration of the conversation. The computer network model, instead, assumed that a computer is permanently “connected” to the network, i.e. that it is “always on”, so that when computer A wants to talk to computer B, it chops the data in packets of appropriate length and then it sends the first packet to the address of computer B attaching to the packet the destination address and the source address. The computer network, being “always on”, knows how to deliver the said packet of data through the different nodes of the network to computer B. When computer A sends the second packet, it is by no means guaranteed that the network will use the same route as the first packet. It may even happen that the second packet arrives before the first packet, because this one has possibly been kept queuing somewhere in other network nodes. This lack of guaranteed packet sequence is the reason why packet networks usually have means to provide “flow control” so as to free applications from this concern. This communication model is then called connection-less

Several protocols were developed to enable transmitters and receivers to exchange computer data in the proper order. Among these is the ITU-T X.25 protocol, developed and widely deployed since the 1970’s. X.25 packets use the High-level Data Link Control frame format. The equivalent of the 2 Mbit/s sync word is a FLAG character of 01111110 Bin or 7E Hex. To avoid emulation of the FLAG by the data, the transmitter inserts a 0 after 5 consecutive 1s, and the receiver deletes a 0 if it follows 5 consecutive 1s (this is called “bit-stuffing”). Having been developed by the telecommunication industry it should be no surprise that X.25 attempted the merging of the connection-oriented and connection-less models in the sense that, once a path is established, packets follow one another in good order through the same route.

The way data move through the nodes of a network is also paradigmatic of the different approaches of the telecommunication and computer worlds. Each of the 30 speech channels contained in a primary multiplex was instantaneously switched to its destination, but an X.25 packet is first stored in the switch and then the entire packet is routed to its destination. Because of the considerable delay that a packet can undergo in a complex X.25 network, a variation of the protocol – dubbed Fast Packet Switching – was introduced in the late 1980s. The computer in the node first interpreted the destination address without storing the full packet and, as soon as it understands it, the packet is immediately routed to its destination. 

It is nice to think of the intersection of two movements: “data from computers become analogue and are carried by the telephone network” and “speech signals become digital data and are processed by computers”, but this would be an ideological reading. ISDN was a project created by the telcos to extend the reach of digital technologies from the core to the access network, the rather primitive – I would say naïve – service idea driving it being the provision of two telephone channels per subscriber. The hope was to optimise the design and management of the network, not to enable a better way to carry computer data at a higher bitrate. Speech digitisation did make the speech signal processable by computers, but in practice the devices that handled digitised speech could hardly be described as computers, as they were devices that managed bits in a very efficient but non-programmable way. 

In the early 1980s there were more and more requests on the part of users to connect geographically dispersed computers through any type of network. This demand prompted the launch of an ambitious project called Open System Interconnection (OSI). The goal was to develop a set of standards that would enable a computer of any make to communicate with another computer of any make across any type of network. The project was started in Technical Committee 97 (TC 97) “Data Processing” of the International Organisation for Standardisation (ISO) and, for obvious reasons, was carried out jointly with ITU-T, probably the first example of a large-scale project executed jointly by two distinct Standard Developing Organisations (SDO). 

For modelling purposes, the project broke down the communication functions of a communication device talking to another communication device into a hierarchical set of layers. This led to the definition of a Reference Model consisting of seven “layers”, a major conceptual achievement of the project. Each layer performed the functions required to communicate with the corresponding layer of the other system (peer-to-peer communication), as if the other layers were not involved. Each layer relied on the layer hierarchically below to have the relevant lower-layer functions performed and it provided “services” to the next higher layer. The architecture was so defined that changes in one layer should not require changes in the other layers. 

The seven OSI layers with the corresponding functions are: 

Name Function
Physical Transmission of unstructured bit streams over the physical link
Data link Reliable transfer of data across the physical link
Network Data transfer independent from the data transmission and switching technologies used to connect systems
Transport Reliable and transparent transfer of data between end points
Session Control structure for communication between applications
Presentation Data transformations appropriate to provide standardised application interface 
Application Services to the users of the OSI environment

In the mid 1980s, the telco industry felt it was ready for the big plunge in the broadband network reaching individual subscribers everybody had dreamed of for decades. The CCITT started a new project, quite independently of the OSI project as the main sponsors of this project were the “transmission and switching” parts of the telcos. The first idea was to scale up the old telecom network in bitrate. This, however, was soon abandoned, for two main reasons: the first was the expected traffic increase of packet-based computer data (the main reason for the American telcos to buy into the project) and the second was the idea that such a network could only be justified by the provision of digital video services such as videoconference and television (the main motivation for the European telcos). 

Thus both envisaged applications were inherently at variable bitrate; the video one because the amount of information generated by a video source depends heavily on its “activity”. Asynchronous Transfer Mode (ATM) was the name given to the technology, that bore a clear derivation from Fast Packet Switching. The basic bitrate was 155 Mbit/s, an attempt at overcoming the differences in transmission bitrates from the two hierarchies spawned by the infamous 25-year old split. The basic cell length was 48 bytes, an attempt at reconciling the two main reasons for having the project: 64 bytes for computer data and 32 bytes for real time transmission.

A horse designed by a committee?

 


A Personal Faultline

During my previous incarnation as a researcher in the video coding field, I made more than one attempt at unification. But do not expect lofty thoughts of global convergence of businesses. At that time my intention was just to achieve common coding architectures that could suit the needs of different industries, without considering the ultimate fate of the individual converged industries. What mattered to me was to enable a “sharing” of development costs for the integrated circuits that were required to transform digital technologies from a promise of a bright but distant future to an even brighter reality – but tomorrow. It is fair to say, though, that in these attempts I was biased by my past experience dealing with devices capable of performing very complex operations on very high-bitrate signals, the reluctance of telcos to make investments in the terminal device area and the readiness of the CE industry to develop products without consideration of standards – provided a market existed. 

I gradually came to the conclusion that preaching the idea at conferences was not enough and that the only way to achieve my goals was by actually teaming up with other industries. The opportunity to put my ideas in practice was offered by the European R&D program bearing the name of Research & development on Advanced Communication for Europe (RACE) that the European Commission had launched in 1984, after the successful take-off of the European Strategic Program of Research and development in Information Technology (ESPRIT) program one year before. 

The Integrated Video Codec (IVICO) project, led by CSELT, was joined by telecommunication operators, broadcasting companies, and manufacturers of integrated circuit and terminal equipment. The project proposal declared the goal of defining, as a first step, a minimum number of common integrated circuits (at that time it was too early to think of a single chip doing everything) such as motion estimation and compensation, DCT, memory control, etc., with the intention of using them for a wide spectrum of applications. The project had a planned duration of one year after which it was expected to be funded for a full five-year period. 

For several reasons, however, the project was discontinued after the first year “pilot” phase. One reason was the hostility from certain European quarters that were concerned by the prospect of being able to use integrated circuits for digital television in a few years – one of the not so rare cases where it does not pay to deliver. This possibility clashed with the official policy of the European Commission, prompted by some European governments and some of the major European manufacturers of CE equipment, that promoted standard and high definition television in analogue form under an improved analogue TV version called MUltiplexed Analogue Components (MAC). The application of digital technologies to television would only happen – so ran the policy – “in the first decade of the third millennium”. 

The demonstrated impossibility of executing a project to develop a microelectronic technology for AV coding, for use by the European industry at large, forced me to rethink the strategy. If the European industrial context was not open to sharing a vital technology, then operating at the worldwide level would shield me from non-technical influences and pressures from my backyard. For somebody who wanted to see things happening for real, this was a significant scaling down of the original ambitions, because it was not conceivable to achieve an international development of a microelectronic technology. On the other hand, this diminution was compensated by the prospect of achieving a truly global solution, but then only of specification and not of technology. 

At that time (mid 1980s) it was not obvious which body should take care of the definition of the common core because “media-related” standardisation was scattered across the three main international bodies and their subdivisions:

  • CCITT (now ITU-T) handled Speech in SG XV WP 1 and Video in SG XV WP 2;
  • CCIR (now ITU-R) handled Audio in SG 10 and video in SG 11;
  • IEC handled Audio Recording in SC 60 A and Video Recording in SC 60 B; Audio-visual equipment in TC 84 and Receivers in SC 12A and SC 12 G;
  • ISO handled Photography in TC 42, Cinematography in TC 36 and Character sets in TC97/SC2.

Chance (or Providence) offered the opportunity to test my idea. During the Globecom conference in Houston, TX in December 1986, where I had a paper on IVICO, I met Hiroshi Yasuda, an alumnus of the University of Tokyo where he was a Ph.D. student like myself during the same years 1968-70. At that time he was well known for his excellent reading of karuta, a word derived from carta, meaning trump, that the Portuguese had brought when they first reached Japan in 1543. Japanese still use to read Hyakunin isshuu poems written on the karuta’s at year end’s parties. After his Ph.D., Hiroshi had become a manager at NTT Communication Laboratories where he was in charge of video terminals. He invited me to come and see the Joint ISO-CCITT Photographic Coding Experts Group (JPEG) activity carried out by a group inside a Working Group (WG) of which he was the Convenor. 

Hiroshi’s WG was formally ISO TC 97/SC 2/WG 8 “Coding of Audio and Picture Information”. SC 2 was a Subcommittee (SC) of TC 97 “Data processing”, the same TC where another Subcommittee (SC 16) was developing the Open System Interconnection (OSI) standard. TC 97 would become, one year later, the joint ISO/IEC Technical Committee JTC 1 “Information Technology”, by incorporating the microprocessor standardisation and other IT activities of the IEC. SC 2’s charter was the development of standards for “character sets”, i.e. the code assignment to characters for use by computers. WG 8 was a new working group established to satisfy the standardisation needs created by the plans of several PTT administrations and companies to introduce pictorial information in various teletext and videotex systems already operational at that time (e.g. the famous Minitel deployed in France). These systems already utilised ISO standards for characters, and audio and pictures were considered as their natural evolution. JPEG was a subgroup of WG 8 tasked with the development of a standard for coded representation of photographic images jointly with CCITT Study Group (SG) VIII “Telematic Services”.

My first attendance at JPEG was at the March 1987 meeting in Darmstadt and I was favourably impressed by the heterogeneous nature of the group. Unlike the various groups of the Conférence Européenne des Postes et Télécommunications and of the European Telecommunication Standards Institute (ETSI) in which I had operated since the late 1970s, JPEG was populated by representatives of a wide range of companies such as telecommunication operators (British Telecom, Deutsche Telekom, KDD, NTT), broadcasting companies (CCETT, IBA), computer manufacturers (IBM, Digital Equipment), terminal equipment manufacturers (NEC, Mitsubishi), integrated circuits (Zoran), etc. By the time of the Copenhagen meeting in January 1988, I had convinced Hiroshi to establish a parallel group to JPEG, called Moving Picture Coding Experts Group (MPEG) with the mandate to develop standards for coded representation of moving pictures. The first project concerned video coding at a bitrate of about 1.5 Mbit/s for storage and retrieval applications “on digital storage media”.

At the same meeting Greg Wallace, then with Digital Equipment Corporation, was appointed as JPEG chairman and another group, called Joint ISO-CCITT Binary Image Coding Experts Group (JBIG), for coding of bilevel pictures such as facsimile, was also established. Yasuhiro Yamazaki of KDD, another alumnus of the University of Tokyo in the same years 1968-70, was appointed as its chairman. 

The reader may think that the fact that three alumni of Tokyo University (Todai, as it is called in Japan) were occupying these positions in an international organisation is a proof that the Todai Mafia was at work. I can assure the reader that this was not the case. It was just one example of how a sometimes-benign and sometimes-malign fate drives the lives of humans.


The 1st MPEG Project

The target of the first MPEG work item was of interest to many: the Consumer Electronics industry because it could create a new product riding the success of CD Audio extending it to video, the IT industry because interactivity with local pictures enabled by the growing computing power of PCs was a great addition to its ever-broadening application scope and the telco industry because of the possibility to promote the development of much-needed integrated circuits for the H.261 real-time audio-visual communication that they, as explained before, were unable to develop by themselves. 

This is a possibly too sweetened a representation of industry feelings at that time because each industry had radically different ways of operation. In the Consumer Electronics world when a new product was devised, each company, possibly in combination with some trusted ally, developed the necessary technology (and filed the enabling patents) and put their – incompatible – version of the product on the market. As other competitors put their versions of the product on the market at about the same time, the different versions competed until the time the market would crown one as the winner. At that time the company or the consortium with the winning product would submit some key enabling technology of the product with a standards body and would start licensing the technology to all companies, competitors included. This had happened for the Compact Cassette when the winner was Philips vs Bosch; for the CD, when the winners were Philips and Sony vs RCA; and for the VHS Video Cassette Recorder (VCR) when the winner was JVC vs Sony. 

The project proposed by MPEG implied a way of operation that was clearly going to upset the established modus operandi of the CE world. Participants knew that, by accepting the rules of international standardisation, they would be deprived of the rightful time-honoured “war booty”, i.e. the exclusive control of the patents needed to build the product that also largely controlled its evolution, in case they were the winners. The advantage for them was that the costly format wars could be avoided.

Another industry had mixed feelings: broadcasting. Even though digital television was a strategically important goal for them, in the second half of the 1980s the bitrate of 1.5 Mbit/s was considered way too low to provide pictures that broadcasters would even remotely consider as acceptable. On the other hand, they clearly understood that the technology used by MPEG could be used for entry-level network-based services and could later be extended to higher bitrates, that were expected to provide a quality of interest to them. A glimpse of their attitude can be seen in the letter that Mr. Richard Kirby, the Director of CCIR at that time, sent to the relevant CCIR SG Chairmen upon receiving news of the establishment of MPEG. The letter requested the Chairmen to study the impact that this unknown group could have on future CCIR activities in the area. 

At my instigation, between January 1988 and the first MPEG meeting in May, a group of European companies had gathered with the intention of proposing a project to the ESPRIT program. A consortium was eventually established and a proposal put together. Called COding of Moving Images for Storage (COMIS), it had dual purposes: to contribute to the successful development of the new standard by pooling and coordinating European partners’ resources, and to give European industry a time lead in exploiting that standard. At the instigation of Hiroshi Yasuda, a project with similar goals was being built in Japan with the name of Digital Audio and Picture Architecture. Some time later, a European project funded by the newly established Eurescom Institute (an organisation established by European telcos) and called Interactive Multimedia Services at 1 Mbit/s (IMS-1) was also launched. 

Therefore, by the time the first meeting of the MPEG group took place in Ottawa, ON in May 1988, the momentum was already building and indeed 29 experts attended that meeting, although some of them were just curious visitors from the JPEG meeting next door. In Ottawa the mandate of the group was established. Drafting this was an exercise in diplomacy. There were already other groups dealing with video coding in ITU-T, ITU-R and CMTT, so the mandate was explicitly confined to Storage and Retrieval on Digital Storage Media (DSM). With this came the definition of the 3 initial planned phases of work:

Phase 1 Coding of moving pictures for DSM’s having a throughput of 1-1.5 Mbit/s
Phase 2 Coding of moving pictures for DSM’s having a throughput of 1.5-10 Mbit/s
Phase 3 Coding of moving pictures for DSM’s having a throughput of 10-60 Mbit/s (to be defined)

People in the business had no doubt about our plans. We intended to start working on low-definition pictures, for which technology was ready to implement and a market was expected to exist because of a great carrier – the CD – existed and because of plans of the CE industry and, partly, the telco industry. The next step would then be to move to standard-definition pictures for which a market did exist because industry was ready to accept digital television as plans for it had been ongoing for years. Eventually we would move to HDTV. These plans were sharply in contrast with those prevailing, especially in European broadcasting circles, where the idea was to start from HDTV and define a top-down hierarchy of compatible coding schemes – technically a good plan, but one that would take years to implement, if ever.

One meeting in Turin and one in London in September followed the Ottawa meeting. So, with the video coding work in MPEG on good foundations, I could pursue another favourite theme of mine. A body dealing with moving pictures with a wide participation of different industries industry was good, but fell short of achieving what I considered a goal of practical value, because audio-only applications are plentiful, but appealing mass-market video-only application are harder to find. The importance of this theme was magnified by my experience of the ISDN videophone project of the ITU. In spite of this project being for an AV application par excellence, the video coding standard (H.261), an outstanding piece of work and the multiplexing standard (H.221), a technically less than excellent piece of work – but never mind – had been developed, but the audio coding part had been left unsettled. This happened because CCITT SG XV had tasked the Video Coding Experts to develop the videophone project, but the Audio Coding experts operated in SG XVIII, and the videotelephone team did not dare to make any decision in a field they had no authority on. 

This organisational structure of the ITU-T, and a similar one in ITU-R and IEC, was a reflection of the organisation of the R&D establishments, and hence of the business, of that time: research groups in audio and video were located in different places of the organisation because of their different background, target products and funding channels. This was also a reflection of the services that had started with audio and then with audio-video, but where video had the lion’s share. My personal experience of television – but I may be biased in my judgment – is that in any service the video signal is always there, but the audio signal is there only if everything goes smooth. This is not because the audio experts have done a lousy job but because the integration of audio and video has never been given the right priority – in research, standardisation, product development and operation.

For a manufacturer of videophone equipment, the easiest thing to do was to use one B channel for compressed video and one B channel with PCM audio, never mind the not-so-subtle irony that one channel carried a bitstream that was the result of a compression of more than 3 orders of magnitude – from 216 Mbit/s down to 64 kbit/s – while the other carried a bitstream in the form prescribed by a 30-year old technology without any compression at all!

So, besides video, the audio component was also needed and an action was required lest MPEG end up like videoconference, with an excellent video compression standard but no audio (music) or with a quality non comparable with the video quality or unjustified different compression rates for the two. The other concern was that integrating the audio component in a system that had not been designed for that could lead to some technical oversights that could only be solved later with some abominable hacks. Hence the idea of a “Systems” activity, conceptually similar to the function executed by H.221 for ISDN videophone, but with a better performance because it was more technically forward looking. The goal of the “Systems” activity was to develop the specification of the complete infrastructure, including multiplexing and synchronisation of audio and video so that building the complete AV solution became possible. 

After the promotional efforts I made in the first months of 1988 to make the industry aware of the video coding work, I undertook a similar effort to inform the industries that MPEG was going to provide a complete audio-visual solution. In this effort I contacted Prof. Hans-Georg Mussmann, director of the Information Processing Institute at the Technical University of Hannover. Hans was well known to me because he had been part of the Steering Committee of the “Workshop on 64 kbit/s coding of moving video“, an initiative that I had started in 1988 to promote the progress of low bitrate video coding research and he had actually hosted the first two workshops. Because of his institute’s and personal standing, Hans was playing a major role in the Eureka project 147 Digital Audio Broadcasting (DAB).

The last meeting of 1988 was held at Hannover. The first two days (29 and 30 November) were dedicated to video matters and held at the old Telefunken labs (those that had developed the PAL system). Part of the meeting was devoted to viewing and selecting video test sequences to be used for simulation work and quality tests. The CCIR library of video sequences had been kindly made available through the good offices of Ken Davies, then with the Canadian Broadcasting Corporation, an acquaintance from the HDTV workshop. Two of the video sequences – “Table Tennis” and “Flower Garden” – selected on that occasion would be used and watched by thousands of people engaged in video coding research both inside and outside of MPEG. Another output of that meeting was the realisation that the MPEG standard, to be fully exploitable for interactive applications on CD-Read Only Memory (CD-ROM), should also be capable of integrating “multimedia” components. Therefore I undertook to see how this request could be fulfilled. 

The last two days (1 and 2 December) saw the kick-off of the audio work with the participation of some 30 experts at Hans’s Institute. Gathering so many audio coding experts had been quite an achievement because, unlike video and speech coding for which there were well established communities developing technologies with a long tradition in standardisation – myself being one element of it – audio coding was a field where the number of researchers was more limited and scattered in a reduced number of places like the research establishment of ATT, CCETT, IRT, Matsushita, Philips, Sony, Thomson and a few others. The Hannover meeting gave attending researchers the opportunity to listen, in some cases for the first time, to the audio coding results of their peers. So the first MPEG subgroup – Audio – was born and Prof. Mussmann was appointed as its chairman. The meeting also produced a document, intended for wide external distribution, which invited interested parties to pre-register their intention to submit proposals for video and audio coding algorithms when MPEG would issue a Call for Proposals (CfP). 

Bellcore, a research organisation spun off from the Bell Labs after the break-up of ATT, hosted the February 1989 meeting at their facilities in Livingston, NJ. The main task of the meeting was to develop the first version of the so-called Proposal Package Description, i.e. a document describing all the elements that proposers of algorithms had to submit in order to have their proposals considered. The document also contained the first ideas concerning the testing of proposals, both subjective and objective. 

That meeting was also memorable (to me) for the attendance of Mr. Roland Zavada of Kodak. Rollie, the chairman of a high-level ISO group coordinating image-related matters, had come to inspect this unheard-of group of experts dealing with Moving Pictures – which he had clearly taken to mean Motion Pictures – with a membership growing at every meeting like mushrooms.

Livingston was followed by Rennes in May and Stockholm in July 1989. The latter meeting produced a new version of the PPD where the video part was final and incorporated in the CfP. This contained operational data for carrying out subjective tests but also data to assess VLSI implementability and to weigh the importance of different features. Similar data were also beginning to populate the part concerning the audio tests. For systems aspects the document was still at the rather preliminary level of requirements. 

At the Stockjolm meeting the second MPEG subgroup – Video – was established and Didier Le Gall, then with Bellcore, was appointed as its chairman. This particular subgroup was established as a formalisation of the most prominent of the ad hoc groups that had already been working, meeting and reporting in the area of Video, Tests, Systems, VLSI implementation complexity and Digital Storage Media (DSM).


MPEG-1 Development – Video

The Kurihama meeting in October 1989 was a watershed in many senses. Fifteen video coding proposals were received, including one from the COMIS project. They contained D1 tapes with sequences encoded at 900 kbit/s, the description of the algorithm used, an assessment of complexity and other data. The site had been selected because JVC had been so kind to offer their outstanding facilities to perform the subjective tests with MPEG experts acting as testing subjects. At the end of the meeting a pretty rough idea of the features of the algorithm could be obtained and plans were made to continue work “by correspondence”, as this kind of work  was called in those pre-internet days. 

About 100 delegates attended the Kurihama meeting. With the group reaching such a size, it became necessary to put in place a formal structure. New subgroups were established and chairmen appointed: Tsuneyoshi Hidaka (JVC), who had organised the subjective tests, led the Test group, Allen Simon (Intel) led the Systems group, Colin Smith (Inmos, later acquired by ST Microelectronics) led the VLSI group and Takuyo Kogure (Matsushita) led the Digital Storage Media (DSM) group. These were in addition to the already established Audio and Video groups chaired by Hans-Georg Musmann (University of Hannover) and Didier Le Gall (Bellcore), respectively. 

In this way, the main pillars of the MPEG organisation were established: the Video group and the Audio group in charge of developing the specific compression algorithms starting from the most promising elements contained in the submissions, the Systems group in charge of developing the infrastructure that held the compressed audio and video information together and made it usable by applications, the Test group assessing video quality (but the Audio group took care directly of organising their own tests), the VLSI group assessing the implementation complexity of compression algorithms and the DSM group studying the (at that time only) application environment (storage) of MPEG standards. In its 25 years, the internal organisation of MPEG has undergone gradual changes, and there has been quite a turnover of chairs. A summary table detailing names of groups and chairs over the 32 years of MPE hisroty is here.

The different subgroups had gradually become the place where heated technical discussions had become the norm, while the MPEG plenary meeting had become a place where the entire work done by the groups was reviewed for the benefits of those who had not had the opportunity to attend other groups’ meetings, but still wanted to be informed possibly even retain the ability to have a say in other groups’ conclusions, to resolve unsettled matters and give a formal seal of approval to all decisions. There were, however, other matters of general interest that also required discussion, but it was no longer practical to have such discussions in the plenary. As a way out, I started convening representatives of the different national delegations in separate meetings at night. This was the beginning of the Heads of Delegation (HoD) group. This name lasted for a quarter of a century until one day someone in ISO discovered that there are no delegations in working groups. From that moment the HoDs were called Convenor Advisors and everything went on as before.

It was during an HoD meeting that the structure of the MPEG standard was discussed. One possible approach was to have a single standard containing everything, the other to split the standard in parts. The former was attractive but it would lead to a standard of monumental proportions. Eventually the approach was adopted, as John Morris of Philips Research, the UK HoD at that time, put it, of making the standard “one and trine”, i.e. a standard in three parts: Systems, Video and Audio. The Systems part would deal with the infrastructure holding together the compressed audio and video (thereby making sure that later implementors would not find holes in the standards), the Video part would deal with video compression, and the Audio part would deal with audio compression. 

From a non-technical viewpoint, but quite important for a fraction of the participants, the meeting was also remarkable because during the lunch break on the first day, news appeared on television that a major earthquake had hit San Francisco and most of the participants from the Bay Area had to leave in a haste. It later became known that fortunately no one connected to the Kurihama meeting participant had been seriously affected, but the work on VLSI implementation complexity clearly suffered, as many of the participants in that activity had arrived from the Bay area. 

At Kurihama there was also a meeting of SC 2/WG 8. A noteworthy event was the establishment of the Multimedia and Hypermedia Experts Group (MHEG), a parallel group to JBIG, JPEG and MPEG. This was the outcome of my undertaking decided at the Hannover meeting one year before to look into the problem of a general multimedia standard. After that meeting I had contacted Francis Kretz of CCETT who had been spearheading an activity in France on the subject and invited him to come to Kurihama. At that meeting Francis was appointed as chair of MHEG, another group parallel to MPEG. 

The Kurihama tests had given a clear indication that the best performing and still most promising video coding algorithm was the one that encoded pictures predictively starting from a motion-compensated previous picture using Discrete Cosine Transform (DCT), à la CCITT H.261. This meant that the new standard could easily support one of the requirements of “compatibility with H.261”, a request made by MPEG telco members. On the other hand, the design of the standard could enjoy more flexibility in coding tools because the target application was “storage and retrieval on DSM” and not real-time communication where information transmission at minimum delay is at a premium. This is why MPEG-1 Video (with a numbering system that was adopted in all subsequent MPEG standards) has interpolation between coded pictures as a tool that an encoder can use.

Philips hosted the following meeting in Eindhoven at the end of January 1990. The most important result was the drafting of the Reference Model (RM) version zero (RM0), i.e. a general description of the algorithm to be used as a test bed to carry out experiments. The Okubo group had also used a similar document with the same name for the development of H.261, but MPEG formalised the process of Core Experiments (CE) as a practical means to improve the RM in a collaborative fashion. A CE was defined as a particular instance of the RM at one stage that allowed the execution of optimisation tests performed on one feature while keeping fixed all other options in the RM. At least two companies had to provide comparably positive results for the CE to qualify for promotion into the standard. This method of developing standards based on CE was a constant. 

RM0 was largely based on H.261. This, and a similar decision made in 1996 to base MPEG-4 Video on ITU-T H.263, is brought by some MPEG critics as a claim that MPEG does not innovate. Those making this remark are actually providing an answer to their own remarks because innovation is not an abstract good in itself. The timely provision of good solutions that enable interoperable – as opposed to proprietary – products and services is the value added that MPEG offers to its constituency – companies large and small. People would be right to see symptoms of misplaced self-assertion, if MPEG were to choose something different from what is known to do a good job just for the sake of it. On the other hand I do claim that MPEG does a lot innovation, but this is at the level of transforming research results into practically implementable audio-visual communication standards, as the thousands past researchers working in MPEG can testify. 

The Eindhoven meeting did not adopt the Common Intermediate Format (CIF) and in its stead it introduced Source Input Format (SIF). Unlike CIF, SIF is not yet another video format. It is two formats in one, but not “new”, because both formats are obtained by subsampling two existing formats: 625 lines @25 frames/s and 525 lines @29.97 frames/s. The former has 288×352 pixels @25 Hz and the latter 240×352 pixels @29.97 Hz. CE results could be shown and the results considered valid irrespective of whether one or the other format was used. 

At the end of the second and last day of the meeting, a hurricane of unseen proportions swept all of the Netherlands. Trees were uprooted, roads blocked and trains stopped midway between stations: a proof that the Forces of Nature were showing again their concern for the work of MPEG.

The development of MPEG-1 Video took two full years in total starting from the Kurihama tests and involved the participation of hundreds of experts, some attending the meetings and many more working in laboratories and providing results to be considered at the next meeting. As a result, an incredible wealth of technical inputs was provided that allowed the development of an optimised algorithm. 

A major role in this effort was played by the VLSI group, chaired by Geoffrey Morrison of BT Labs, who had replaced Colin Smith, the initiator of the group. The VLSI group provided the neutral place where the impact on the complexity of different proposals – both video and audio – was assessed. MPEG-1 is a standard optimised for the VLSI technology of those years, because real-time video and audio decoding – never mention encoding – was only possible with integrated circuits. Even though the attention of that time concentrated around VLSI implementation complexity, the subgroup already considered software implementation complexity as part of their mandate. 

At the following meeting in Tampa, FL, hosted by IBM, the name Reference Model was abandoned in favour of Simulation Model (SM). This tradition of sequentially changing the name of the Model at each new standard has continued: the names Test Model (TM), Verification Model (VM), eXperimentation Model (XM) and sYstem Model (YM) have been used for each of the MPEG-2, MPEG-4, MPEG-7 and MPEG-21 standards, respectively. 

Still in the area of software, an innovative proposal was made at the Santa Clara, CA meeting in September 1990 by Milton Anderson, then with Bellcore. The proposal amounted to using a slightly modified version of the C programming language to describe the more algorithmic parts of the standard. The proposal was accepted, and this marked the first time that a (pseudo-) computer programming language had been used in a standard to complement the text. The practice has now spread to virtually all environments doing work in this and similar areas and has actually been extended to describe the entire standard in software as we will see soon.


MPEG-1 Development – Audio

Work in the Audio group was also progressing. Many participants were people interested in audio-only applications, some of them working in the Eureka 147 DAB project. For the majority of them it was important to develop a standard that would provide compressed audio with CD quality at the bitrate of 256 kbit/s because that bitrate was suitable for digital audio broadcasting. This target affected the video work and indeed video simulation results had to be shown at 1.15 Mbit/s because this was the remaining bitrate from the total payload of about 1.4 Mbit/s of the CD. 

The approach of the Audio group in the development of the standard was somewhat different than the one followed by the Video group. Instead of producing a general CfP, the Audio group first worked to cluster the proposals that the different companies were considering.  

These were the four clusters: 

  1. Transform Coding with overlapping blocks 
  2. Transform Coding with non-overlapping blocks 
  3. Subband Coding with less than or equal to 8 subbands 
  4. Subband Coding with more than 8 subbands. 

The clusters were encouraged to provide a single proposal and this indeed happened. Swedish Radio (SR) was kind enough to perform the subjective test of the four clustered proposals using “golden ears”, i.e. specialists capable of detecting the slightest imperfection in a sound. The results of the subjective tests were shown in Stockholm in June 1990 (this was formally a part of the Porto meeting in July, where the rest of MPEG was meeting). The reason for having this meeting in Stockholm was to be able to listen to the submissions in the same setup the golden ears had used for the tests. 

The first clustered proposal performed the best in terms of subjective quality. However, implementation penalty was not unexpectedly higher than the fourth clustered proposal that scored less but with a lower implementation complexity. This was an undoubted challenge that the audio chairman resolved with time and patience. This was also the last achievement of Hans Musmann who left MPEG at the Paris meeting in May 1991. His place was taken over by Prof. Peter Noll of the University of Berlin. 

The result of the work was an audio coding standard that, unlike the corresponding video standard, was not monolithic because there were three different “flavours”: the first – called Layer I – was based on subband coding and had low complexity but the lowest performance, the second – called Layer II – was again based on subband coding with average complexity and good performance and the third – called Layer III – was based on transform coding and provided the best performance, but at a considerably higher implementation cost. So much so that, at that time, many considered Layer III as impractical for a mass-market product. Therefore there could be 3 different conforming implementations of the MPEG-1 Audio standard, one for each layer. The condition was imposed, however, that a standard MPEG-1 Audio decoder of a “higher” layer had to be able to decode all the “lower” layers. 

The verification tests carried out before releasing the MPEG-1 Audio standard showed that subjective transparency, defined as a rating of the encoded stereo signal greater than 4.6 in the 5-point CCIR quality scale, as assessed by “golden ears”, was achieved at 384 kbit/s for Layer I, 256 kbit/s for Layer II and 192 kbit/s for Layer III. The promise to achieve “CD quality” at 256 kbit/s with compressed audio had been met and surpassed. Later, with continuous improvements in encoding (which is not part of the standard) even better results could be achieved.


MPEG-1 Development – Systems

The development of the Systems part of the standard was done using yet another methodology. The Systems group, a most diversified collection of engineers from multiple industries, after determining the requirements the Systems layer had to satisfy, decided that they did not need a CfP, because the requirements were so specific that they felt they could simply design the standard by themselves in a collaborative fashion. The initial impetus was provided by Juan Piñeda, then with Apple Computer, at the Porto meeting in July 1990, when he proposed the first packet-based multiplexer. Eventually Sandy MacInnis of IBM became the chairman of that group after Allen Simon’s resignation. 

One of the issues the group had to deal with was “byte alignment”, a typical requirement from the computer world that the telco world, because of its “serial” approach to bitstreams, did not value. This can be seen, e.g., from the not byte-aligned H.261 bitstream. Byte alignment was eventually supported in MPEG-1 Systems because the system decoding of a byte-aligned 150 kbyte/s stream was already feasible using the CPUs of that time. In the process, the MPEG-1 Video syntax, too, was made byte aligned. 

Another issue was the choice between constant and variable packet size. One could have thought that, because the main target of MPEG-1 was Digital Storage Media where disk formats have a fixed block size, a fixed packet length should have been selected. Eventually, however, this did not happen, a consequence of the fact that the physical format of the disc is, in OSI terminology, a layer 2 issue, while packet multiplexing is a higher-layer issue that did not necessarily have to relate with the former. In conclusion MPEG-1 Systems turned out to be a very robust and flexible specification capable of supporting the transfer of tightly synchronised video and audio streams across an arbitrary error-free delivery system. 

The second MPEG London meeting in November 1992 put the seal on MPEG-1, with the approval of the first three parts of the standard: Systems, Video and Audio. Since then the standard has been very stable: in spite of its complexity, very few corrigenda were published after that date: two for Systems, three for Video and one for Audio. The MPEG-1 work did not stop at that meeting, though, because work on conformance and reference software continued well into 1994.