Discrete Cosine Transform in Video Compression – Explain Like I’m 5

In this article of the “Hitchhiker’s Guide to Video Compression” series, let’s understand what Transforms are with a simple and intuitive example. And, then let’s go on to understand the role of the Discrete Cosine Transform (DCT) in Image and Video compression.👋 This is part of a series of articles titled “The Hitchhiker’s Guide to Video Compression” – a gentle and opinionated introduction to the fascinating world of video compression.

🔔 Subscribe to get notified of “The Hitchhiker’s Guide to Video Compression” articles directly in your inbox, or, follow via Twitter or, LinkedIn.

Let’s Start With A Simple Exercise

Before we take on the profoundly mathematical subject of transforms in signal processing, let’s first get an intuitive understanding of the “why” of transforms. Why are they needed, and what is their role?

Disclaimer: video engineer here and not a photoshop expert 🙂

Take a look at the image below and imagine that you are looking at three spheres/balls through your window. If I ask you which sphere is the biggest, you’ll be able to tell me, right?

discrete cosine transform
Front View of Three Spheres Through a Window

Well, this appears easy. The sphere on the left appears to be the smallest and the one on the right is the biggest. Right?

Are You Sure, Though?

Now, If I show you an aerial view of the three spheres (taken using a drone!), do you think it will change your mind? Not sure? Let’s take a look.

discrete cosine transform
Top View of The Three Spheres

In this image, you are looking at

  • the three spheres from above, and
  • the thick blue line is the top view of the window (hence, it looks 2-dimensional)

The spheres appear to be the same size as each other. Except, they are at different distances from the window (house).

The left-most sphere is very far from the window, and hence it looks the smallest. The sphere to the right is close to the window, and hence it looks the biggest!

Now, do you want to change your answer?

Your View of the Data was Transformed

Let’s take a minute to understand what we just did.

We took a piece of data and developed two views of the data by changing our physical position. That is, we looked at the data

  • from front and
  • from above

And, these two “different” views combined gave us a much better understanding of the data!

Now, let’s take a look at another example of a transform.

Starry Skies and Constellations

When you look into the starry night and locate a constellation, ask yourself this – are all the stars in that constellation on the same plane? Or are they very far away from each other?

Here is a fantastic video that shows how “correlated” stars in a constellation are. You’ll see that the stars are very far away from each other, but, they appear to lie on the same plane, and in a particular shape because we are looking at the stars from Earth.

The distance between the stars became apparent when we transformed our “viewing position” that allowed us to gain more information about the data and this is what a transform does.

A transform is a mathematical function that transforms (changes) the input data from one domain (view) to another to,

  • expose hidden characteristics of the data, or
  • gain a better (combined) understanding of the data, or
  • highlight or downplay specific characteristics of the data.

I spent time explaining “transforms” because this is typically the blocking point for most people trying to understand the discrete cosine transform or, for that matter, any transform (Fourier, Z, Laplace, etc.)

With this understanding, let’s learn about the famous Discrete Cosine Transform.

Explain DCT Like I’m Five

After all the math and technical jargon, let’s try and explain the DCT to a 5-year-old (it’s hard but let’s try).

Imagine you are playing “I Spy With My Little Eye” with a small kid. It is a game where a person chooses an item in a room in his mind, and the rest have to guess it by asking questions (like 20 questions).

Now, let’s assume that I think of a painting of a boat hanging on the wall and ask a 5-year-old to ask questions and guess what I am thinking of.

For the sake of this example, let’s tweak the rules and provide the 20 clues ourselves.

The best clue that we can provide is something like this “the item is hanging on the wall opposite the door”. This is guaranteed to lead the kid close to the prize, right?

The next best clue would be something like “the item is a square or box-shaped”. After that, we could say something like “it has an ocean and a boat in it.”.

If you do this carefully, you don’t need 20 clues to guide the kid to the painting – in all likelihood, 5 – 8 clues would suffice. Some of the clues can lead the kid to the answer faster than the other clues – this is an indication of the amount of information those clues hold within them.

Time to pause and consider what we have done so far.

So, what you’ve done is,

  • taken some data as input (the location of a painting),
  • converted it into a set of 20 clues (output),
  • arranged the clues in order of importance (how much information does the clue contain),
  • and shown that just a few of the clues can capture the painting’s location while the rest of them add details to the puzzle.

The Discrete Cosine Transform does something similar.

The Discrete Cosine Transform

  • takes data in one form,
  • converts it to another form such that the output data is arranged in decreasing order of importance.
  • This allows us to use only a few of the output data points and still get back to the original data.

I hope you understood what the DCT is trying to achieve (in essence).

Because the kid gloves have to come off now, and the ELI5 ends here 🙂 The remaining sections require high school math and some understanding of programming – preferably in MATLAB or Octave.

Introducing the Discrete Cosine Transform

The Discrete Cosine Transform or DCT is a widely used transform for image and video compression. The definition as per Wikipedia is as follows:-

The Discrete Cosine Transform expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies.

Still with me? Don’t worry about the math behind the DCT for now.

In simpler terms, the Discrete Cosine Transform takes a set of N correlated (similar) data-points and returns N de-correlated (dis-similar) data-points (coefficients) in such a way that the energy is compacted in only a few of the coefficients M where M << N.

If that doesn’t make sense, try and remember this – the DCT takes input data and converts it to another domain while doing two very important things –

  1. it de-correlates the data (removes any relation or similarity between the data points)
  2. it compacts the energy/information into only a few of the output data points

To summarize, the DCT takes

  • N data-points as its input
  • returns N data-points as its output
  • and ensures that most of the input information is concentrated in only a few of the N output data-points.

This is called the energy-compaction (or information compaction) property of the DCT.

Let’s understand this with an example using MATLAB.

Take an 8x8 matrix filled with the number 255 as shown below. If you need 8 bits to store each number, you need 8 x 64 = 512 bits to store the entire matrix. Good so far?

discrete cosine transform

Okay, now, let’s apply an 8x8 2D-DCT to this matrix and get 8x8 DCT coefficients. The 2D-DCT is the Two-Dimensional form of the DCT used on 2D data such as grayscale images.

The output of the 2D-DCT operation on the 8x8 matrix is shown below.

discrete cosine transform

The output looks different, doesn’t it?

If you notice carefully, the first coefficient ([0, 0]) element of the matrix is non-zero and the rest of the elements are zero. This greatly reduces the storage space needed for this matrix.

This is due to the de-correlating and energy compaction property of the DCT. This is usually described as follows in technical literature and can be a little difficult to understand –

the DCT has compacted the matrix’s energy into the first element, referred to as the DC coefficient. The rest of the coefficients are called the AC coefficients

This means that –

  • the top-left corner of the output 2D DCT is called the DC coefficient. It is the most important output of the DCT and contains a lot of information about the original image.
  • the rest of the coefficients are called the AC coefficients. If you are transforming an image using the DCT, the AC coefficients contain the image’s finer details.

Now, if I take these DCT coefficients and apply an inverse-2D-DCT to it, I will get back the original coefficients.

If you want to try it out, here are the MATLAB commands to replicate the above experiment.

inputPixels = ones(8,8) * 255;
dctCoeffs = dct2(inputPixels);
reconstructedPixels = idct2(dctCoeffs);

Application of DCT to Image & Video Compression

The decorrelation and energy compaction properties of the DCT make it extremely attractive to image and video compression. The Karhunen–Loève transform (KLT) often referred to as the ideal transform has much better decorrelating properties but is computationally intractable. The DCT on the other hand, is easy to program and has captured the image and video compression world.

Here is a simple MATLAB script that demonstrates the power of the 2D-DCT when applied to image compression (and, in extension to video compression).

% read an image (MATLAB provides a few sample images)
RGB = imread('autumn.tif');

% convert to grayscale
I = rgb2gray(RGB);

% compute the 2D DCT
J = dct2(I);

% discard certain coefficients (set to zero)
J(abs(J) < T) = 0;

% recover the pixels using the inverse 2D DCT
K = idct2(J);

% matlab code to display the original and reconstructed image
figure
imshowpair(I,K,'montage')
title('Original Grayscale Image (Left) and Reconstructed Image (Right)');

The code is straightforward –

  • load an image (RGB) and convert it to gray-scale
  • compute the 2D-DCT and store it in J
  • set all the coefficients whose magnitude is less than a threshold T to 0
  • compute the inverse 2D-DCT and recover the pixels (reconstructed)
  • compare the two images – original and reconstructed.

Now, let’s run two experiments.

Experiment 1: Let’s set the threshold to 50 and set all the AC coefficients whose magnitude is less than 50 to 0. And, then reconstruct the image using the inverse-DCT. Remember, we do not touch the DC coefficient in this example (whose magnitude is much, much greater than 50).

If you take a look at the image below, you can see that it is blurred and doesn’t have all the features. But, here is the amazing part – we set a large number of coefficients to zero, we retained only 3.45% of the total 71070 coefficients. So, with just 3.45% of the coefficients, we could reconstruct an “okay” looking image – quite blurred, but you could recognize the image.

discrete cosine transform
Image reconstructed using only 3.45% of the DCT Coefficients

Experiment 2: Now, let’s change the threshold to 10 and set all coefficients whose magnitude is less than 10 to 0. This time, the number of non-zero coefficients i.e., the coefficients that we have retained is 23.45% of the total. Here is the reconstructed image (after inverse 2D-DCT) – looks amazing right??

discrete cosine transform
Image reconstructed using 23.45% of the DCT Coefficients

Why did this happen in both situations? Here are some reasons –

  • It is because we protected the DC coefficient and did not discard it.
  • In the first case, we threw away a lot of AC coefficients and that affected the finer details of the image as seen by the blurriness.
  • In the second case, we preserved many more AC coefficients and helped retain some of the finer details.

Now, if you reduce the threshold even further, you can improve the quality of the image.

Ultimately, these examples show that the DCT can be used to drastically compress data and recover it (in a lossy manner) even by discarding more than 50% of the output of the DCT.

The Takeaway From These Experiments

These two examples show the Discrete Cosine Transform’s power and its twin-properties of decorrelation and energy compaction. They demonstrate that you can discard a large percentage of the DCT coefficients and still manage to reconstruct the image with reasonable quality.

Conclusion

I hope you gained an intuitive understanding of the Discrete Cosine Transform and how its properties help in image and video compression. I didn’t delve into the deep mathematical details, but there is a ton of literature online if you want to dig deeper.


PS: This article trended in the top 3 at HackerNews for the better part of a day, got 8000+ views, and I received positive feedback along with constructive criticism that has led to revision of some sections and further simplifying the explanation. A sincere “thank you” to all my anon reviewers 🙂

About The Author

I’m Dr. Krishna Rao Vijayanagar, and I am the Founder and Editor of OTTVerse.com. I've spent several years working hands-on with Video Codecs (AVC, HEVC, MultiView Plus Depth), ABR streaming, and Video Analytics (QoE, Content & Audience, and Ad). I hope to use my experience and love for video streaming to bring you information and insights into the OTT universe. Please use the Contact Page to get in touch with me.

Leave a Reply