Discrete Cosine Transform in Video Compression - Explain Like I'm 5

The Discrete Cosine Transform (DCT) is arguably the most fundamental tool in modern image and video compression. The DCT is used to convert data in the pixel domain to the frequency domain to reveal insights about the image or video frame.

The goal of this article is to give you an intuitive understanding of the DCT without all the crazy math 🙂

So, let’s go!

Table of Contents

What are Transforms (the T in DCT)?

Before we talk about transforms in signal processing, let’s get an intuitive understanding of the “why” of transforms. Why are transforms needed, and what is their role?

Imagine that you are looking at three spheres/balls through a window (as shown in the image below). Which of these spheres do you think is the biggest and which is the smallest?

discrete cosine transform — Front View of Three Spheres from a Window

Well, this appears easy. The sphere on the left appears to be the smallest and the one on the right is the biggest. Right?

Are You Sure, Though?

Now, If I show you an aerial view of the three spheres (using a drone), do you think it will change your mind? Not sure? Let’s take a look.

In this image, you are looking at

the three spheres from above, and
the thick blue line is the top view of the window (hence, it looks 2-dimensional)

The spheres all are the same size – but, they are at different distances from the window (house).

The left-most sphere is very far from the window, and hence it looks the smallest.
The sphere to the right is close to the window, and hence it looks the biggest!.

Fascinating, right?

Your View of the Data was Transformed

Let’s take a minute to understand what we just did and why our answer changed.

We took a piece of data and developed two views or perspectives of the data by changing our physical position. That is, we looked at the data

from the front, and
from above

And, these two views combined gave us a better understanding of the data and helped us give the right answer.

Okay, now lets leave this example aside and look at another one where changing your perspective gives you so much information about the data.

Starry Skies and Constellations

When you look into the starry night and locate a constellation, ask yourself this – are all the stars in that constellation on the same 2D plane? Or are they very far away from each other?

Here is a fantastic video that shows how “correlated” stars in a constellation are. You’ll see that the stars are very far away from each other, but, they appear to lie on the same plane, and in a particular shape because we are looking at the stars from Earth.

The distance between the stars became apparent when we transformed our “viewing position” to gain more information about the data, which is what a transform does.

Okay – So, What is a Transform?

A transform is a mathematical function that transforms (changes) the input data from one domain (view) to another. Applying or using a transform helps,

expose hidden characteristics of the data, or
gain a better (combined) understanding of the data, or
highlight or downplay specific characteristics of the data.

Examples of transforms include the

Discrete cosine transform
Discrete sine transform
Wavelet transforms
Fourier transforms
Laplace transforms
Z-transforms

Now, that you know what a transform does, let’s learn about the famous Discrete Cosine Transform.

Explain DCT Like I’m Five (ELIF)

After all the math and technical jargon, let’s try and explain the DCT to a 5-year-old (it’s hard but let’s try).

Imagine you are playing “I Spy With My Little Eye” with a small kid. It is a game where a person chooses an item in a room in his mind, and the others have to guess the item by asking questions (like 20 questions). For the sake of this example, let’s tweak the rules and provide the 20 clues ourselves. Bear with me 🙂

Now, let’s assume that I think of a painting of a boat hanging on the wall and ask a 5-year-old to guess what I am thinking of.

What do you think is the best clue that we can provide? (or rather ask).

How about “the item is hanging on the wall opposite the door, and below the doorbell.” Such a detailed clue is guaranteed to lead the kid close to the prize, right?

Or, we can use a combination of two clues –

First clue: “the item is square or box-shaped.”
Second clue: “it has an ocean and a boat in it.”.

If you do this carefully, you don’t need 20 clues to guide the kid to the painting – in all likelihood, 5 – 8 clues would suffice. Some of the clues can lead the kid to the answer faster than the other clues. This is an indication of the amount of information those clues hold within them.

So, what you’ve done is,

taken some data as input (the location of a painting),
converted it into a set of 20 clues (output),
arranged the clues in order of importance (how much information does the clue contain),
and shown that just a few of the clues can capture the painting’s location while the rest of them add details to the puzzle.

The Discrete Cosine Transform

takes data in one form (pixel domain)
converts it to another form such that the output data is arranged in decreasing order of importance. (transform domain)
This allows us to use only a few of the output data points and still get back to the original data.

I hope you understood what the DCT is trying to achieve (in essence).

Because the kid gloves have to come off now, and the ELI5 ends here 🙂 The remaining sections require high school math and some understanding of programming – preferably in MATLAB or Octave.

Introducing the Discrete Cosine Transform

The Discrete Cosine Transform or DCT is a widely used transform for image and video compression. The definition as per Wikipedia is as follows:-

The Discrete Cosine Transform expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies.

Still with me? Don’t worry about the math behind the DCT for now.

In simpler terms, the Discrete Cosine Transform takes a set of N correlated (similar) data-points and returns N de-correlated (dis-similar) data-points (coefficients) in such a way that the energy is compacted in only a few of the coefficients M where M << N.

If that doesn’t make sense, try and remember this – the DCT takes input data and converts it to another domain while doing two very important things –

it de-correlates the data (removes any relation or similarity between the data points)
it compacts the energy/information into only a few of the output data points

To summarize, the DCT takes

N data-points as its input
returns N data-points as its output
and ensures that most of the input information is concentrated in only a few of the N output data-points.

This is called the energy-compaction (or information compaction) property of the DCT.

Let’s understand this with an example using MATLAB.

Take an 8x8 matrix filled with the number 255 as shown below. If you need 8 bits to store each number, you need 8 x 64 = 512 bits to store the entire matrix. Good so far?

Okay, now, let’s apply an 8x8 2D-DCT to this matrix and get 8x8 DCT coefficients. The 2D-DCT is the Two-Dimensional form of the DCT used on 2D data such as grayscale images.

The output of the 2D-DCT operation on the 8x8 matrix is shown below.

The output looks different, doesn’t it?

If you notice carefully, the first coefficient ([0, 0]) element of the matrix is non-zero and the rest of the elements are zero. This greatly reduces the storage space needed for this matrix.

This is due to the de-correlating and energy compaction property of the DCT. This is usually described as follows in technical literature and can be a little difficult to understand –

the DCT has compacted the matrix’s energy into the first element, referred to as the DC coefficient. The rest of the coefficients are called the AC coefficients

This means that –

the top-left corner of the output 2D DCT is called the DC coefficient. It is the most important output of the DCT and contains a lot of information about the original image.
the rest of the coefficients are called the AC coefficients. If you are transforming an image using the DCT, the AC coefficients contain the image’s finer details.

Now, if I take these DCT coefficients and apply an inverse-2D-DCT to it, I will get back the original coefficients.

If you want to try it out, here are the MATLAB commands to replicate the above experiment.

inputPixels = ones(8,8) * 255;
dctCoeffs = dct2(inputPixels);
reconstructedPixels = idct2(dctCoeffs);

Application of Discrete Cosine Transform to Image & Video Compression

The decorrelation and energy compaction properties of the DCT make it extremely attractive to image and video compression. The Karhunen–Loève transform (KLT) (often referred to as the ideal transform) has much better decorrelating properties but is computationally intractable. The DCT on the other hand, is easy to program, and has captured the image and video compression world, and quite closely approximates the output of the KLT.

Here is a simple MATLAB script that demonstrates the power of the 2D-DCT when applied to image compression (and, in extension to video compression).

% read an image (MATLAB provides a few sample images)
RGB = imread('autumn.tif');

% convert to grayscale
I = rgb2gray(RGB);

% compute the 2D DCT
J = dct2(I);

% discard certain coefficients (set to zero)
J(abs(J) < T) = 0;

% recover the pixels using the inverse 2D DCT
K = idct2(J);

% matlab code to display the original and reconstructed image
figure
imshowpair(I,K,'montage')
title('Original Grayscale Image (Left) and Reconstructed Image (Right)');

The code is straightforward –

load an image (RGB) and convert it to gray-scale
compute the 2D-DCT and store it in J
set all the coefficients whose magnitude is less than a threshold T to 0
compute the inverse 2D-DCT and recover the pixels (reconstructed)
compare the two images – original and reconstructed.

Now, let’s run two experiments.

Experiment 1: Let’s set the threshold to 50 and set all the AC coefficients whose magnitude is less than 50 to 0. And, then reconstruct the image using the inverse-DCT. Remember, we do not touch the DC coefficient in this example (whose magnitude is much, much greater than 50).

If you take a look at the image below, you can see that it is blurred and doesn’t have all the features. But, here is the amazing part – we set a large number of coefficients to zero, we retained only 3.45% of the total 71070 coefficients. So, with just 3.45% of the coefficients, we could reconstruct an “okay” looking image – quite blurred, but you could recognize the image.

Experiment 2: Now, let’s change the threshold to 10 and set all coefficients whose magnitude is less than 10 to 0. This time, the number of non-zero coefficients i.e., the coefficients that we have retained is 23.45% of the total. Here is the reconstructed image (after inverse 2D-DCT) – looks amazing right??

Why did this happen in both situations? Here are some reasons –

It is because we protected the DC coefficient and did not discard it.
In the first case, we threw away a lot of AC coefficients and that affected the finer details of the image as seen by the blurriness.
In the second case, we preserved many more AC coefficients and helped retain some of the finer details.

Now, if you reduce the threshold even further, you can improve the quality of the image.

Ultimately, these examples show that the DCT can be used to drastically compress data and recover it (in a lossy manner) even by discarding more than 50% of the output of the DCT.

The Takeaway From These Experiments

These two examples show the Discrete Cosine Transform’s power and its twin-properties of decorrelation and energy compaction. They demonstrate that you can discard a large percentage of the DCT coefficients and still manage to reconstruct the image with reasonable quality.

Conclusion

I hope you gained an intuitive understanding of the Discrete Cosine Transform and how its properties help in image and video compression. I didn’t delve into the deep mathematical details, but there is a ton of literature online if you want to dig deeper.

PS: This article trended in the top 3 at HackerNews for the better part of a day, got 8000+ views, and I received positive feedback along with constructive criticism that has led to revision of some sections and further simplifying the explanation. A sincere “thank you” to all my anon reviewers 🙂

Krishna Rao Vijayanagar

Founder at OTTVerse

Krishna Rao Vijayanagar, Ph.D., is the Editor-in-Chief of OTTVerse, a news portal covering tech and business news in the OTT industry.

With extensive experience in video encoding, streaming, analytics, monetization, end-to-end streaming, and more, Krishna has held multiple leadership roles in R&D, Engineering, and Product at companies such as Harmonic Inc., MediaMelon, and Airtel Digital. Krishna has published numerous articles and research papers and speaks at industry events to share his insights and perspectives on the fundamentals and the future of OTT streaming.

Discrete Cosine Transform in Video Compression – Explain Like I’m 5