The concept of I-frames, P-frames, and B-frames is fundamental to the field of video compression. These three frame types are used in specific situations to improve the codec’s compression efficiency, the compressed stream’s video quality, and the resilience of the stream to transmission and storage errors & failures.
In this tutorial, we look at how I-frames, P-frames, and B-frames work and what they are used for. If you are into video compression, do read our
- tutorial on the discrete cosine transform,
- why video compression is important,
- and a layman’s explanation of what a video codec is and how it’s created.
Okay, with that, let’s get started with a couple of fundamental aspects of modern day video compression – Intra and Inter prediction.
Inter and Intra Prediction
I won’t do a deep dive of Intra and Inter-prediction in this article, but, I’ll give you an idea of why these exist and what they are meant for.
Take, for example, the image below. It shows two video frames (adjacent to each other) with a rectangular block of black pixels. In frame 1, the block is on the left-hand side of the image, and in the second frame, it has moved to the right.
If I want to compress Frame #2 using a modern video codec like H.264 or HEVC, I would do something as follows –
- Break the video into blocks of pixels (macroblocks) and compress them one at a time.
- In order to compress each macroblock, the first step is to find a macroblock similar to the one we want to compress by searching in the current frame or previous or future frames.
- The best-match macroblock’s location is recorded (which frame and its position in that frame). Then, the two macroblocks’ difference is compressed and sent to the decoder along with the location information.
With me so far? Good!
Take a look at the image below. If I want to compress the macroblock in Frame #2 (that I’ve marked with a red square), what do you think is the best option? Or how should it be done?
- First, I can look in frame #1 and find the matching block. It appears to have moved by a distance approximately the frame’s width (a little less, I know) and approximately at the same height. Good, we have the motion vector now.
- I can search within the same frame and quickly realize that the block above the one marked in red is IDENTICAL to it. So, I can tell the decoder to copy that one instead of hunting in another frame. The motion vector (if any) is also minimal.
Now take a look at the next example. We want to compress the macroblock containing the blue sphere in frame #2. How should we go about doing this? Search within the same frame or search in previously encoded frames?
- First, I can look in frame #1 and find the matching sphere. It appears to have moved by a distance approximately the frame’s width (a little less, I know) and moved up a little. This gives us the motion vector. The difference between the two blocks containing spheres appears to be very small (guesstimate!)
- Second, I can search within the same frame and realize no other block contains a sphere. So, bad luck searching for a match within the same frame!
So, what did we learn from these toy examples?
- Encoders search for matching macroblocks to reduce the size of the data that needs to be transmitted. This is done via a process of motion estimation and compensation. This allows the encoder to find the horizontal and vertical displacement of a macroblock in another frame.
- An encoder can search for a matching block within the same frame (Intra Prediction) and adjacent (Inter Prediction) frames.
- It compares the Inter and Intra prediction results for each macroblock and chooses the “best” one. This process is dubbed “Mode Decision,” and in my opinion, it’s the heart of a video codec.
Again, sorry for the super fast explanation of Intra and Inter Prediction. It’s a vast topic, and I haven’t even scratched the surface. In a future article, I’ll talk about the different ways to perform motion estimation, fast searches, sub-pel motion estimation, early exits, and so many amazing aspects of motion estimation and compensation!
For now, with this whirlwind introduction to Intra and Inter prediction, let’s learn about I, P, and B frames.
What is an I-frame?
An I-frame or a Key-Frame or an Intra-frame consists ONLY of macroblocks that use Intra-prediction. That’s it.
Every macroblock in an I-frame is allowed to refer to other macroblocks only within the same frame. That is, it can only use “spatial redundancies” in the frame for compression. Spatial Redundancy is a term used to refer to similarities between the pixels of a single frame.
An I-frame comes in different avatars in different video codecs as IDR, CRA, or BLA frames, but the essence remains the same for these types of I-frames — no temporal prediction allowed in an I-frame.
An I-frame has many uses which we’ll study after the introduction to P and B-frames.
What is a P-frame?
P-frame stands for Predicted Frame and allows macroblocks to be compressed using temporal prediction in addition to spatial prediction. For motion estimation, P-frames use frames that have been previously encoded. In essence, every macroblock in a P-frame can be,
- temporally predicted, or
- spatially predicted, or
- skipped (i.e, you tell the decoder to copy the co-located block from the previous frame – a “zero” motion vector).
I made an illustration to drive home an important point. You can see both I-frames and P-frames in the image above. The P-frames refer to previously encoded I/P-frames as discussed earlier. You can also see that the order in which the frames are encoded/decoded is the same as how they are presented to the user. This is because P-frames only refer to previously encoded pictures.
What is a B-frame?
A B-frame is a frame that can refer to frames that occur both before and after it. The B stands for Bi-Directional for this reason. If your video codec uses macroblock-based compression (like H.264/AVC does), then each macroblock of a B-frame can be
- predicted using backward prediction (using frames that occur in the future)
- predicted using forward prediction (using frames that occur in the past)
- predicted without inter-prediction – only Intra
- skipped completely (with Intra or Inter prediction).
And because B-frames have the option to refer to and interpolate from two (or more) frames that occur before and after it (in the time dimension), B-frames can be incredibly efficient in reducing the size of the frame while retaining the video quality. They can exploit both spatial and temporal redundancy (future & past frames) making them very useful in video compression.
However, B-frames are resource-heavy – both at the encoder and decoder. Let’s see why!
To understand the impact of B-frames, let’s understand the concepts of Presentation/Display Order and Decoding Order.
Taking the simple case of I and P frames. If you use only these two picture types, every frame will either refer to itself (I-frame) or to a previous frame (P-frame). So, the frames can come in and out of the encoder in the same order. Here, the Presentation Order (or Display Order) is the same as the Decode/Encode order.
But, if a frame refers to another frame that is displayed in the future, what do you do? This is the situation that we see when we use B-frames for compression. Take a look at the image below that shows a GOP (group of pictures) structure that uses two B-pictures and one P in each mini-GOP. i.e., IBBPBBP.
Frame #2 in Display Order is a B-frame that depends on Frames 1 & 4 as references. But, to encode Frame#2, we need to wait till Frame #4 enters the encoder, is encoded, and only then available as a Reference for Frame #2.
The same thing occurs at the decoder.
The decoder decodes Frame #1 (I-frame) and then Frame #2 (P-frame) in decoding order. But, it cannot display Frame #2 because it is actually Frame #4 in Display Order! So, the decoder needs to place Frame #2 (in decoding order) into a buffer until it is time to display it.
So, the encoder and decoder need to maintain two “orders” or “queues” in their memory – one to place the frames in the correct display order, and another to place the frames in the other required to encode & decode them.
Due to the re-ordering requirements, B-frames impact the size of the decoder’s buffer and increases latency.
This is why many systems place strict restrictions on the number of frames that can be used as references to compress a single B-frame. Along the same lines, H.264/AVC’s Baseline profile is aimed at low-end devices and does not allow the use of B-frames or slices.
Reference B-frame and Non-Reference B-frames
As we’ve learned, a B-frame can refer to two or more frames – typically, one in the future and one in the past regarding its position. We also learned that an I-frame does not refer to any other frame and a P-frame refers to a picture in the past. The question naturally arises – can any picture use a B-picture as its reference frame?
The answer is Yes.
- A B-frame can act as a reference, and if so, it is termed as a reference B-frame.
- If a B-frame is not to be used as a reference, it is called a non-reference B-frame.
It is important to signal whether a frame is a reference or a non-reference B-frame in the bitstream because the decoder needs to store reference frames in its DPB (Decoded Picture Buffer).
If a frame is signaled as a non-reference B-frame, and it is used as a reference, the decoder could crash because, in all likelihood, the decoder would have discarded that frame after decoding and displaying it.
Most decoders will quantize reference B-frames at a better quality than non-reference B-frames to minimize propogation losses.
Use of I, P, and B-frames in Video Compression & Streaming
With a technical understanding of how I-frames, P-frames, and B-frames work, now let’s tackle an important question. Why should you use them?
In the next few sections, let’s understand the most important use cases of I, P, and B-frames in video compression.
Where do you use I-frames?
We learned in the earlier sections that I-frames can be independently encoded and decoded and this drives their usage in video compression.
Refreshing Video Quality
I-frames are generally inserted to designate the end of a GOP (Group of Pictures) or a video segment (refer to our article on ABR streaming fundamentals). Because I-frame compression is not dependent on previously-encoded pictures, it can “refresh” the video quality. Encoders are typically tuned to favor I-frames in terms of size and quality because they play a critical role in maintaining video quality. After encoding an I-frame with high video quality, the encoder can then use it as a reference picture to compress P and B-frames.
Are I-frames used only for refreshing the video quality? Nope!
Recovery from Bitstream Errors
Remember, we said that I-frames could be independently encoded and decoded? This implies that I-frames can be used to recover from catastrophic failures in the video file or video streaming.
Let’s see how.
If a P-frame or a reference B-frame is corrupted, then all the frames that are dependent on them cannot be decoded satisfactorily, and this leads to glitches in the video. The video usually cannot recover from such problems. However, when a corrupted video stream reaches an I-frame, it can independently encode-decode it and recover from the problem. This is a clean restart of the decoding process, and IF all the frames from that point onwards refer to frames after the I-frame, then the video can recover.
Such I-frames are typically referred to as an Instantaneous Decoder Refresh or an IDR frame. And, the act of not referring to pictures before the I-frame is called a closed GOP. (Note: we’ll talk about GOPs, mini-GOPs, closed GOPs, and open GOPs in another article on OTTVerse.com).
IDR frames are used commonly in ABR streaming to denote a new segment of the video. By starting each segment with an IDR, the platform can ensure that every segment can be decoded independently of other video segments. This property ensures that the video playback can continue even if a few segments are corrupted or lost due to transmission problems.
Trick Modes (Seeking Forward and Back)
Finally, Key-frames are vital for trick modes!
If you want to seek forward or back in a video, you need to have an I-frame at the point of restarting the video. Right?
Think about it, if you seek a P or a B-frame and the decoder has already dumped its reference frames from memory, how are you going to reconstruct them? The video player will naturally seek a starting point (an I-frame) to decode successfully and start playing back from that point onwards.
This brings us to another interesting point.
If you place Key Frames far apart in the video – say every 20 seconds, then your users can seek in increments of 20 seconds only. That’s a terrible user experience!
However, if you place too many key-frames, the seeking experience will be great, but the video’s size will be too big and might result in buffering!
Designing the optimal GOP and mini-GOP structure is truly a balancing art 🙂
Where do you use P and B frames?
This is a very common question that people ask: Where, when, and how do I use P-frames and B-frames?
If you understood how P-frames and B-frames work from the previous sections, you’d recognize that P & B-frames reduce the video’s size while retaining the video quality. That is their main use! P and B-frames are inserted at appropriate places to reduce the video’s file size or bitrate and are tuned to maintain a certain video quality level.
Based on the GOP and mini-GOP structure you decide to use in the encoder, P frames and B-frames (both reference and non-reference are inserted) and compressed with relevant QP values to achieve the target bitrate or quality.
I hope this article on I-frames, P-frames, and B-frames helped increase your knowledge about video compression. To improve your understanding, download a static FFmpeg build, and play around with the GOP,
no-b-frame settings in FFmpeg to see how the size of the video and its quality vary.
Good luck, and do let me know what you think of the article in the comments. Please share it with your friends on LinkedIn, Twitter, Facebook, HackerNews, and Reddit if you liked it.
Thank you and see you next time for another technical tutorial on OTTVerse.com.
Krishna Rao Vijayanagar
I’m Dr. Krishna Rao Vijayanagar, founder of OTTVerse. I have a Ph.D. in Video Compression from the Illinois Institute of Technology, and I have worked on Video Compression (AVC, HEVC, MultiView Plus Depth), ABR streaming, and Video Analytics (QoE, Content & Audience, and Ad) for several years.
I hope to use my experience and love for video streaming to bring you information and insights into the OTT universe.
That was awesome. Thank you very much.
You’re welcome! Please share the article with your friends if you found it useful.
Hello, this is awsome, can I forward to my blog? Thank you!
Nice explanation. Thanks…
This is awesome,
please, can you send a reference list, cu level, and GOP in vvc standards
Very short, but very useful. Now I understood the principle. Thank you.
Great explanation. Thank you
Thank you so much!
Very fine article Dr. Dr. Krishna Rao Vijayanagar. I’d like to connect to demonstrate how we are reducing data size further without too much loss in quality, at least for the purpose of auditing the footage, and also to create a very cost effective disaster recovery mechanism.
As always another very well written article.
I would like to understand :
Can GOP size set during encoding help decide segmentation duration at the packaging? OR How someone will know where to start a segment without knowing GOP size?
Thanks, Prashant for your comment! Typically, you need to know the GOP size during encoding because it will set the location of an IDR (Closed GOP). This same value can be passed on to the packager which can use the GOP size as a guide and cut the video at that location.
If you do not know the GOP size, you can still parse the bitstream using the spec and identify the type of frame. But, the former is easier.
It was very informative and interesting. Thank You
Thank you Praveen – much appreciated!