I, P, and B-frames - Differences and Use Cases Made Easy

I-frames, P-frames, and B-frames are fundamental to video compression. These three frame types are used in specific situations to improve the codec’s compression efficiency, the compressed stream’s video quality, and the resilience of the stream to transmission and storage errors & failures.

In this tutorial, we look at how I-frames, P-frames, and B-frames work and what they are used for. Let’s start with some fundamental aspects of modern-day video compression – Intra and Inter prediction.

Table of Contents

Inter and Intra Prediction

I won’t do a deep dive of Intra and Inter-prediction in this article, but, I’ll give you an idea of why these exist and what they are meant for.

Take, for example, the image below. It shows two video frames (adjacent to each other) with a rectangular block of black pixels. In frame 1, the block is on the left-hand side of the image, and in the second frame, it has moved to the right.

If I want to compress Frame #2 using a modern video codec like H.264 or HEVC, I would do something as follows –

Break the video into blocks of pixels (macroblocks) and compress them one at a time.
In order to compress each macroblock, the first step is to find a macroblock similar to the one we want to compress by searching in the current frame or previous or future frames.
The best-match macroblock’s location is recorded (which frame and its position in that frame). Then, the two macroblocks’ difference is compressed and sent to the decoder along with the location information.

With me so far? Good!

Take a look at the image below. If I want to compress the macroblock in Frame #2 (that I’ve marked with a red square), what do you think is the best option? Or how should it be done?

I-frames, P-frames, and B-frames and Intra Prediction

First, I can find the matching block in frame #1. It appears to have moved by a distance of approximately the frame’s width and approximately at the same height. This movement gives us the motion vector.
I can search within the same frame and quickly realize that the block above the one marked in red is IDENTICAL to it. So, I can tell the decoder to copy that one instead of hunting in another frame. The motion vector (if any) is also minimal.

Now take a look at the next example. We want to compress the macroblock containing the blue sphere in frame #2. How should we go about doing this? Search within the same frame or search in previously encoded frames?

First, I can look in frame #1 and find the matching sphere. It appears to have moved by a distance approximately the frame’s width (a little less, I know) and moved up a little. This gives us the motion vector. The difference between the two blocks containing spheres appears to be very small (guesstimate!)
Second, I can search within the same frame and realize no other block contains a sphere. So, bad luck searching for a match within the same frame!

So, what did we learn from these toy examples?

Encoders search for matching macroblocks to reduce the size of the data that needs to be transmitted. This is done via a process of motion estimation and compensation. This allows the encoder to find the horizontal and vertical displacement of a macroblock in another frame.
An encoder can search for a matching block within the same frame (Intra Prediction) and adjacent (Inter Prediction) frames.
It compares the Inter and Intra prediction results for each macroblock and chooses the “best” one. This process is dubbed “Mode Decision,” and in my opinion, it’s the heart of a video codec.

Again, sorry for the super fast explanation of Intra and Inter Prediction. It’s a vast topic, and I haven’t even scratched the surface. In a future article, I’ll talk about the different ways to perform motion estimation, fast searches, sub-pel motion estimation, early exits, and so many amazing aspects of motion estimation and compensation!

For now, with this whirlwind introduction to Intra and Inter prediction, let’s learn about I, P, and B frames.

What is an I-frame?

An I-frame or a Key-Frame or an Intra-frame consists ONLY of macroblocks that use Intra-prediction. That’s it.

Every macroblock in an I-frame can refer to other macroblocks only within the same frame. It can only use “spatial redundancies” in the frame for compression. Spatial Redundancy is a term used to refer to similarities between the pixels of a single frame.

An I-frame comes in different avatars in different video codecs, such as IDR, CRA, or BLA frames, but the essence remains the same for these types of I-frames — no temporal prediction is allowed in these types of I-frames an I-frame.

An I-frame has many uses, which we’ll study after introducing P and B-frames.

What is a P-frame?

P-frame stands for Predicted Frame and allows macroblocks to be compressed using temporal prediction in addition to spatial prediction. For motion estimation, P-frames use frames that have been previously encoded. In essence, every macroblock in a P-frame can be,

temporally predicted, or
spatially predicted, or
skipped (i.e, you tell the decoder to copy the co-located block from the previous frame – a “zero” motion vector).

I made an illustration to drive home an important point. You can see both I-frames and P-frames in the image above. The P-frames refer to previously encoded I/P-frames as discussed earlier. You can also see that the order in which the frames are encoded/decoded is the same as how they are presented to the user. This is because P-frames only refer to previously encoded pictures.

What is a B-frame?

A B-frame is a frame that can refer to frames that occur both before and after it. The B stands for Bi-Directional for this reason. If your video codec uses macroblock-based compression (like H.264/AVC does), then each macroblock of a B-frame can be

predicted using backward prediction (using frames that occur in the future)
predicted using forward prediction (using frames that occur in the past)
predicted without inter-prediction – only Intra
skipped completely (with Intra or Inter prediction).

And because B-frames can refer to and interpolate from two (or more) frames that occur before and after it (in the time dimension), B-frames can be incredibly efficient in reducing the size of the frame while retaining the video quality. They can exploit both spatial and temporal redundancy (future & past frames), making them very useful in video compression.

However, B-frames are resource-heavy – both at the encoder and decoder. Let’s see why!

To understand the impact of B-frames, let’s understand the concepts of Presentation/Display Order and Decoding Order.

Taking the simple case of I and P frames. If you use only these two picture types, every frame will refer to itself (I-frame) or a previous frame (P-frame). So, the frames can come in and out of the encoder in the same order. Here, the Presentation Order (or Display Order) is the same as the Decode/Encode order.

I-frames and P-frames Decoding and Display or Presentation Order

But what do you do if a frame refers to another frame that is displayed in the future? This is the situation that we see when we use B-frames for compression. Look at the image below that shows a GOP (group of pictures) structure that uses two B-pictures and one P in each mini-GOP. i.e., IBBPBBP.

Frame #2 in Display Order is a B-frame that depends on Frames 1 & 4 as references. But, to encode Frame#2, we need to wait till Frame #4 enters the encoder, is encoded, and only then available as a Reference for Frame #2.

The same thing occurs at the decoder.

In decoding order, the decoder decodes Frame #1 (I-frame) and then Frame #2 (P-frame). But the player cannot display Frame #2 because it is Frame #4 in Display Order! So, the decoder needs to place Frame #2 (in decoding order) into a buffer until it is time to display it.

So, the encoder and decoder need to maintain two “orders” or “queues” in their memory – one to place the frames in the correct display order and another to place the frames in the other required to encode & decode them.

Due to the re-ordering requirements, B-frames impact the size of the decoder’s buffer and increase latency.

This is why many systems restrict the number of frames that can be used as references to compress a single B-frame. Along the same lines, H.264/AVC’s Baseline profile is aimed at low-end devices and does not allow using B-frames or slices.

Reference B-frame and Non-Reference B-frames

As we’ve learned, a B-frame can refer to two or more frames – typically, one in the future and one in the past regarding its position. We also learned that an I-frame does not refer to any other frame and a P-frame refers to a picture in the past. The question arises: Can any picture use a B-picture as its reference frame?

The answer is Yes.

A B-frame can act as a reference, and if so, it is termed as a reference B-frame.
If a B-frame is not to be used as a reference, it is called a non-reference B-frame.

It is important to signal whether a frame is a reference or a non-reference B-frame in the bitstream because the decoder needs to store reference frames in its DPB (Decoded Picture Buffer).

If a frame is signaled as a non-reference B-frame, and it is used as a reference, the decoder could crash because, in all likelihood, the decoder would have discarded that frame after decoding and displaying it.

Most decoders will quantize reference B-frames at a better quality than non-reference B-frames to minimize propogation losses.

Use of I, P, and B-frames in Video Compression & Streaming

With a technical understanding of how I-frames, P-frames, and B-frames work, now let’s tackle an important question. Why should you use them?

In the next few sections, let’s understand the most important use cases of I, P, and B-frames in video compression.

So, where do you use I-frames? We learned in the earlier sections that I-frames can be independently encoded and decoded, which drives their usage in video compression.

Using I-frames for Refreshing Video Quality

I-frames are generally inserted to designate the end of a GOP (Group of Pictures) or a video segment (refer to our article on ABR streaming fundamentals). Because I-frame compression is not dependent on previously-encoded pictures, it can “refresh” the video quality. Encoders are typically tuned to favor I-frames in size and quality because they play a critical role in maintaining video quality. After encoding an I-frame with high video quality, the encoder can use it as a reference picture to compress P and B-frames.

Are I-frames used only for refreshing the video quality? Nope!

Recovery from Bitstream Errors

Remember, we said that I-frames could be independently encoded and decoded? This implies that I-frames can be used to recover from catastrophic failures in the video file or video streaming.

Let’s see how.

If a P-frame or a reference B-frame is corrupted, then all the frames that are dependent on them cannot be decoded satisfactorily, and this leads to glitches in the video. The video usually cannot recover from such problems. However, when a corrupted video stream reaches an I-frame, it can independently encode-decode it and recover from the problem. This is a clean restart of the decoding process, and IF all the frames from that point onwards refer to frames after the I-frame, then the video can recover.

Such I-frames are typically called an Instantaneous Decoder Refresh or an IDR frame. And, not referring to pictures before the I-frame is called a Closed GOP. (Note: we’ll talk about GOPs, mini-GOPs, closed GOPs, and open GOPs in another article on OTTVerse.com).

IDR frames to start segments in ABR Streaming

IDR frames are commonly used in ABR streaming to denote a new video segment. By starting each segment with an IDR, the platform can ensure that every segment can be decoded independently of other video segments. This property ensures that the video playback can continue even if a few segments are corrupted or lost due to transmission problems.

Trick Modes (Seeking Forward and Back)

Finally, Key-frames are vital for trick modes!

If you want to seek forward or back in a video, you need to have an I-frame when restarting the video. Right?

Think about it, if you seek a P or a B-frame and the decoder has already dumped its reference frames from memory, how will you reconstruct them? The video player will naturally seek a starting point (an I-frame) to decode successfully and start playing back from that point onwards.

This brings us to another interesting point.

If you place Key Frames far apart in the video – say every 20 seconds, then your users can seek in increments of 20 seconds only. That’s a terrible user experience!
However, if you place too many key-frames, the seeking experience will be great, but the video’s size will be too big and might result in buffering!

Designing the optimal GOP and mini-GOP structure is truly an art 🙂

Where do you use P and B frames?

People ask this very common question: Where, when, and how do I use P-frames and B-frames?

If you understood how P-frames and B-frames work from the previous sections, you’d recognize that P & B-frames reduce the video’s size while retaining the video quality. That is their main use! P and B-frames are inserted at appropriate places to reduce the video’s file size or bitrate and are tuned to maintain a certain quality level.

Based on the GOP and mini-GOP structure you decide to use in the encoder, P frames and B-frames (reference and non-reference are inserted) and compressed with relevant QP values to achieve the target bitrate or quality.

Conclusion

I hope this article on I-frames, P-frames, and B-frames helped increase your knowledge about video compression. To improve your understanding, download a static FFmpeg build, and play around with the GOP, no-b-frame settings in FFmpeg to see how the size of the video and its quality vary.

Good luck, and do let me know what you think of the article in the comments. Please share it with your friends on LinkedIn, Twitter, Facebook, HackerNews, and Reddit if you like it.

Thank you, and see you next time for another technical tutorial on OTTVerse.com.

Krishna Rao Vijayanagar

Founder at OTTVerse

Krishna Rao Vijayanagar, Ph.D., is the Editor-in-Chief of OTTVerse, a news portal covering tech and business news in the OTT industry.

With extensive experience in video encoding, streaming, analytics, monetization, end-to-end streaming, and more, Krishna has held multiple leadership roles in R&D, Engineering, and Product at companies such as Harmonic Inc., MediaMelon, and Airtel Digital. Krishna has published numerous articles and research papers and speaks at industry events to share his insights and perspectives on the fundamentals and the future of OTT streaming.

Pingback: ffprobe - Comprehensive Tutorial with 7 Examples - OTTVerse

Hasan Sharifi

September 12, 2021 at 4:02 pm

That was awesome. Thank you very much.

Krishna
September 13, 2021 at 8:51 am

You’re welcome! Please share the article with your friends if you found it useful.

WenhanQiao

September 16, 2021 at 9:09 am

Hello, this is awsome, can I forward to my blog? Thank you!

Kuldeep Verma

September 29, 2021 at 1:23 pm

Nice explanation. Thanks…

Rana

October 5, 2021 at 7:24 pm

This is awesome,
please, can you send a reference list, cu level, and GOP in vvc standards
thank you

Siavash Ardekani

October 11, 2021 at 7:18 pm

Very short, but very useful. Now I understood the principle. Thank you.

dashne

June 19, 2022 at 6:41 pm

Great explanation. Thank you

Krishna
June 22, 2022 at 2:09 pm

Thank you so much!

ozair ahmad

August 17, 2022 at 11:40 am

great sir

Gautam D Goradia

August 27, 2022 at 10:18 am

Very fine article Dr. Dr. Krishna Rao Vijayanagar. I’d like to connect to demonstrate how we are reducing data size further without too much loss in quality, at least for the purpose of auditing the footage, and also to create a very cost effective disaster recovery mechanism.

Prashant Kulkarni

September 15, 2022 at 11:32 pm

As always another very well written article.

I would like to understand :
Can GOP size set during encoding help decide segmentation duration at the packaging? OR How someone will know where to start a segment without knowing GOP size?

Krishna
October 2, 2022 at 6:03 pm

Thanks, Prashant for your comment! Typically, you need to know the GOP size during encoding because it will set the location of an IDR (Closed GOP). This same value can be passed on to the packager which can use the GOP size as a guide and cut the video at that location.

If you do not know the GOP size, you can still parse the bitstream using the spec and identify the type of frame. But, the former is easier.

Praveen R

October 11, 2022 at 11:58 am

It was very informative and interesting. Thank You

Krishna
November 20, 2022 at 9:14 am

Thank you Praveen – much appreciated!

1. Roshan
  May 3, 2023 at 7:05 pm
  
  Hi Krishna, is it possible to recreate the original video after you have extracted the I P B frames? If I am understanding correctly, the P and B frames are much smaller than the key I frame and broken into macroblocks. But I’m not sure how to reconstitute the original video from the extracted I P B frames.
  
  1. vkr2020
    May 4, 2023 at 10:27 am
    
    Every frame is broken into macroblocks in codecs like H.264, HEVC, AV1, etc. Think of I, P, B as three ways to treat a video frame – based on the quality you need, and the use case.
    
    The codec specification tells you what tools to use to compress a video frame and how to reverse the process at the decoder. When we say that the P & B frames are smaller, it means that they are compressed more than the I-frame. But, as long as you use the process dictated by the codec specification, the process is completely reversible at the decoder.
    
    Also, keep in mind that modern codecs are lossy codecs. When you classify a frame (I, P, or B) and compress it, you are throwing away a certain amount of information and that is lost forever.
    
    If all of this is confusing, let me know and I’ll try and explain better.

Pingback: What's a Video Codec? Comprehensive Guide with Examples - OTTVerse

Pingback: Easy Guide to HEVC Encoding using FFmpeg - CRF, CBR, 2-Pass, and More! - OTTVerse

Pingback: Temporally Coherent Stable Diffusion Videos via a Video Codec Approach - Metaphysic.ai

Anthony R. King

August 26, 2023 at 9:45 pm

Contemplating creating a front-end to ffmpeg aimed (initially) at non-destructive video-editing, I recall from when doing this with AVIs with VirtualDub, I could only trim non-destructively at the start at I-frames. VirtualDub even had jump to next/previous I-frame controls, as I recall.

My question then is when we potentially now have B-frames in some H264 varieties for example, would it be the case that we could achieve per-frame non-destructive trimming at the start of a video? In which case, how can I establish how close to a desired cut-point I can use? i.e. How can I even establish whether a video contains B-frames?

vkr2020
August 31, 2023 at 4:09 pm

You can use ffrobe to check if your video has B-frames. Check out the “per-frame” example here (https://ottverse.com/ffprobe-comprehensive-tutorial-with-examples/) and look for the pict_type field which will tell you if you are dealing with an I, P, or B frame. Non-destructive trimming is hard because B-frames refer to other frames (for motion estimation and compensation). If you only had I and P frames in your video, its much easier, because P-frames only refer to frames that occur before it (in decoding order). So, if you cut at the 30th frame, then the 29th frame is sufficient to decode the 30th frame (if it was a Pframe). However, if the 30th frame was a B-frame, there is a likelihood that it depends on probably the 29th and maybe the 31st frame too.

Hope this helps!

Pingback: Parking Lot Rules, B-Frames and Ultra Low-Latency Encoding – NETINT technologies

I, P, and B-frames – Differences and Use Cases Made Easy