# Discrete Cosine Transform in Video Compression – Explain Like I’m 5

The Discrete Cosine Transform (DCT) is a fundamental tool in modern image and video compression. The DCT is used to convert data in the pixel domain to the frequency domain and this is done to reveal insights about the information contained in the pixels. In this article, we break down the DCT and try to explain it as simply as possible.

Let’s go!

Before we take on the profoundly mathematical subject of transforms in signal processing, let’s first get an intuitive understanding of the “why” of transforms. Why are they needed, and what is their role?

Disclaimer: video engineer here and not a photoshop expert 🙂

Take a look at the image below and imagine that you are looking at three spheres/balls through your window. If I ask you which sphere is the biggest, you’ll be able to tell me, right?

Well, this appears easy. The sphere on the left appears to be the smallest and the one on the right is the biggest. Right?

### Are You Sure, Though?

Now, If I show you an aerial view of the three spheres (taken using a drone!), do you think it will change your mind? Not sure? Let’s take a look.

In this image, you are looking at

• the three spheres from above, and
• the thick blue line is the top view of the window (hence, it looks 2-dimensional)

The spheres appear to be the same size as each other. Except, they are at different distances from the window (house).

• The left-most sphere is very far from the window, and hence it looks the smallest.
• The sphere to the right is close to the window, and hence it looks the biggest!

Now, do you want to change your answer to the original question?

## Your View of the Data was Transformed

Let’s take a minute to understand what we just did and why our answer changed.

We took a piece of data and developed two views or perspectives of the data by changing our physical position. That is, we looked at the data

• from front and
• from above

And, these two “different” views combined gave us a much better understanding of the data and compelled us to revist our position and ideas about the information we were presented with.

Okay, now lets leave this example aside and look at another one where changing your perspective gives you so much information about the data.

## Starry Skies and Constellations

When you look into the starry night and locate a constellation, ask yourself this – are all the stars in that constellation on the same plane? Or are they very far away from each other?

Here is a fantastic video that shows how “correlated” stars in a constellation are. You’ll see that the stars are very far away from each other, but, they appear to lie on the same plane, and in a particular shape because we are looking at the stars from Earth.

The distance between the stars became apparent when we transformed our “viewing position” to gain more information about the data, which is what a transform does.

So, what is a transform?

A transform is a mathematical function that transforms (changes) the input data from one domain (view) to another to,

• expose hidden characteristics of the data, or
• gain a better (combined) understanding of the data, or
• highlight or downplay specific characteristics of the data.

I spent time explaining “transforms” because this is typically the blocking point for most people trying to understand the discrete cosine transform or, for that matter, any transform (Fourier, Z, Laplace, etc.)

With this understanding of transforms in signal processing, let’s learn about the famous Discrete Cosine Transform.

## Explain DCT Like I’m Five

After all the math and technical jargon, let’s try and explain the DCT to a 5-year-old (it’s hard but let’s try).

Imagine you are playing “I Spy With My Little Eye” with a small kid. It is a game where a person chooses an item in a room in his mind, and the others have to guess the item chosen by asking questions (like 20 questions).

Now, let’s assume that I think of a painting of a boat hanging on the wall and ask a 5-year-old to ask questions and guess what I am thinking of.

For the sake of this example, let’s tweak the rules and provide the 20 clues ourselves. Bear with me 🙂

What do you think is the best clue that we can provide? Something like this “the item is hanging on the wall opposite the door, and below the doorbell.” Such a detailed clue is guaranteed to lead the kid close to the prize, right?

The next best clue would be something like “the item is a square or box-shaped”. After that, we could say something like “it has an ocean and a boat in it.”.

If you do this carefully, you don’t need 20 clues to guide the kid to the painting – in all likelihood, 5 – 8 clues would suffice. Some of the clues can lead the kid to the answer faster than the other clues – this is an indication of the amount of information those clues hold within them.

Time to pause and consider what we have done so far.

So, what you’ve done is,

• taken some data as input (the location of a painting),
• converted it into a set of 20 clues (output),
• arranged the clues in order of importance (how much information does the clue contain),
• and shown that just a few of the clues can capture the painting’s location while the rest of them add details to the puzzle.

The Discrete Cosine Transform does something similar.

The Discrete Cosine Transform

• takes data in one form,
• converts it to another form such that the output data is arranged in decreasing order of importance.
• This allows us to use only a few of the output data points and still get back to the original data.

I hope you understood what the DCT is trying to achieve (in essence).

Because the kid gloves have to come off now, and the ELI5 ends here 🙂 The remaining sections require high school math and some understanding of programming – preferably in MATLAB or Octave.

## Introducing the Discrete Cosine Transform

The Discrete Cosine Transform or DCT is a widely used transform for image and video compression. The definition as per Wikipedia is as follows:-

The Discrete Cosine Transform expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies.

Still with me? Don’t worry about the math behind the DCT for now.

In simpler terms, the Discrete Cosine Transform takes a set of `N` correlated (similar) data-points and returns `N` de-correlated (dis-similar) data-points (coefficients) in such a way that the energy is compacted in only a few of the coefficients `M` where `M << N`.

If that doesn’t make sense, try and remember this – the DCT takes input data and converts it to another domain while doing two very important things –

1. it de-correlates the data (removes any relation or similarity between the data points)
2. it compacts the energy/information into only a few of the output data points
Related:  Comprehensive Guide to LCEVC (MPEG-5 Part 2) - Low Complexity Enhancement Video Coding

To summarize, the DCT takes

• N data-points as its input
• returns N data-points as its output
• and ensures that most of the input information is concentrated in only a few of the N output data-points.

This is called the energy-compaction (or information compaction) property of the DCT.

Let’s understand this with an example using MATLAB.

Take an `8x8` matrix filled with the number `255` as shown below. If you need `8` bits to store each number, you need `8 x 64 = 512` bits to store the entire matrix. Good so far?

Okay, now, let’s apply an `8x8` 2D-DCT to this matrix and get `8x8` DCT coefficients. The 2D-DCT is the Two-Dimensional form of the DCT used on 2D data such as grayscale images.

The output of the 2D-DCT operation on the `8x8` matrix is shown below.

The output looks different, doesn’t it?

If you notice carefully, the first coefficient (`[0, 0]`) element of the matrix is non-zero and the rest of the elements are zero. This greatly reduces the storage space needed for this matrix.

This is due to the de-correlating and energy compaction property of the DCT. This is usually described as follows in technical literature and can be a little difficult to understand –

the DCT has compacted the matrix’s energy into the first element, referred to as the DC coefficient. The rest of the coefficients are called the AC coefficients

This means that –

• the top-left corner of the output 2D DCT is called the DC coefficient. It is the most important output of the DCT and contains a lot of information about the original image.
• the rest of the coefficients are called the AC coefficients. If you are transforming an image using the DCT, the AC coefficients contain the image’s finer details.

Now, if I take these DCT coefficients and apply an inverse-2D-DCT to it, I will get back the original coefficients.

If you want to try it out, here are the MATLAB commands to replicate the above experiment.

``````inputPixels = ones(8,8) * 255;
dctCoeffs = dct2(inputPixels);
reconstructedPixels = idct2(dctCoeffs);
``````

## Application of DCT to Image & Video Compression

The decorrelation and energy compaction properties of the DCT make it extremely attractive to image and video compression. The Karhunen–Loève transform (KLT) often referred to as the ideal transform has much better decorrelating properties but is computationally intractable. The DCT on the other hand, is easy to program and has captured the image and video compression world.

Here is a simple MATLAB script that demonstrates the power of the 2D-DCT when applied to image compression (and, in extension to video compression).

``````% read an image (MATLAB provides a few sample images)

% convert to grayscale
I = rgb2gray(RGB);

% compute the 2D DCT
J = dct2(I);

% discard certain coefficients (set to zero)
J(abs(J) < T) = 0;

% recover the pixels using the inverse 2D DCT
K = idct2(J);

% matlab code to display the original and reconstructed image
figure
imshowpair(I,K,'montage')
title('Original Grayscale Image (Left) and Reconstructed Image (Right)');
``````

The code is straightforward –

• load an image (RGB) and convert it to gray-scale
• compute the 2D-DCT and store it in `J`
• set all the coefficients whose magnitude is less than a threshold `T` to
• compute the inverse 2D-DCT and recover the pixels (reconstructed)
• compare the two images – original and reconstructed.

Now, let’s run two experiments.

Experiment 1: Let’s set the threshold to `50` and set all the AC coefficients whose magnitude is less than `50` to . And, then reconstruct the image using the inverse-DCT. Remember, we do not touch the DC coefficient in this example (whose magnitude is much, much greater than `50`).

If you take a look at the image below, you can see that it is blurred and doesn’t have all the features. But, here is the amazing part – we set a large number of coefficients to zero, we retained only 3.45% of the total `71070` coefficients. So, with just 3.45% of the coefficients, we could reconstruct an “okay” looking image – quite blurred, but you could recognize the image.

Experiment 2: Now, let’s change the threshold to `10` and set all coefficients whose magnitude is less than `10` to . This time, the number of non-zero coefficients i.e., the coefficients that we have retained is 23.45% of the total. Here is the reconstructed image (after inverse 2D-DCT) – looks amazing right??

Why did this happen in both situations? Here are some reasons –

• It is because we protected the DC coefficient and did not discard it.
• In the first case, we threw away a lot of AC coefficients and that affected the finer details of the image as seen by the blurriness.
• In the second case, we preserved many more AC coefficients and helped retain some of the finer details.

Now, if you reduce the threshold even further, you can improve the quality of the image.

Ultimately, these examples show that the DCT can be used to drastically compress data and recover it (in a lossy manner) even by discarding more than 50% of the output of the DCT.

### The Takeaway From These Experiments

These two examples show the Discrete Cosine Transform’s power and its twin-properties of decorrelation and energy compaction. They demonstrate that you can discard a large percentage of the DCT coefficients and still manage to reconstruct the image with reasonable quality.

## Conclusion

I hope you gained an intuitive understanding of the Discrete Cosine Transform and how its properties help in image and video compression. I didn’t delve into the deep mathematical details, but there is a ton of literature online if you want to dig deeper.

PS: This article trended in the top 3 at HackerNews for the better part of a day, got 8000+ views, and I received positive feedback along with constructive criticism that has led to revision of some sections and further simplifying the explanation. A sincere “thank you” to all my anon reviewers 🙂 ##### Krishna Rao Vijayanagar

I’m Dr. Krishna Rao Vijayanagar, and I have worked on Video Compression (AVC, HEVC, MultiView Plus Depth), ABR streaming, and Video Analytics (QoE, Content & Audience, and Ad) for several years.

I hope to use my experience and love for video streaming to bring you information and insights into the OTT universe.