EasyVMAF: Running VMAF In The Wild

VMAF is one of the most popular tools for video quality assessment, and it is well on its way to becoming a standard reference metric for the video industry. However, running VMAF can get tricky in some situations and lead to wrong results. In this article, guest author Gabriel Dávila Revelo introduces us to a tool (easyVMAF) that he’s developed to make VMAF computations easier and straightforward.

VMAF is a full reference metric that compares the reference (or source) and distorted video sequences to predict the subjective video quality.

The advantage of VMAF lies in the fact that it seeks to mimic the viewer’s perception (or human visual system) instead of purely objective metrics such as PSNR or SSIM. For an in-depth explanation of VMAF, please read the official Netflix blog.

Several 3rd-party tools have been built on top of VMAF with the participation of the open-source community. As a result, VMAF is available via several open-source packages such as the VMAF python library, VMAFossexec (C executable), a VMAF docker image, libvmaf (C library), and via FFmpeg compiled with libvmaf.

Note: You can find Installation procedures for FFmpeg, VMAF, and usage here on OTTVerse.com

Inspite of several tools being available to compute VMAF, it is often challenging to adhere to the strict set of requirements that the VMAF tool imposes. Some of these requirements are –

The reference and distorted videos need to be frame-synchronized. The scan mode (interlaced/progressive) has to match; they have to have the same duration; they have to have the same frame-rate.
The reference and distorted videos’ resolution has to match, which often requires a robust up/down-scaling procedure.

Hence in practice, the videos have to be normalized/equalized if the reference and distorted videos do not meet the above-mentioned requirements.

In this tutorial, we will go through a series of recommendations using FFmpeg-based examples to show you how to normalize your streams for using VMAF.

Finally, we introduce easyVMAF, an open-source tool to automate your VMAF computations.

In the next few sections, we will tackle the basics of (i) scaling videos, (ii) frame synchronization, and (iii) finally; we talk about easyVMAF.

To keep the article’s length reasonable, we show only basic examples. For complete use cases, please head over to https://github.com/gdavila/easyVMAF.

Without further ado, let’s get started!

Table of Contents

Scaling Video Resolution to The Right VMAF Model

The available VMAF implementations support three models: HD, 4K, and Phone.

Each model was trained by the team at Netflix, considering different scenarios such as screen size, resolution, and viewer’s distance from the display device.

Accordingly, VMAF specifications require that video resolutions match with the video resolution expected by each mode (refer to the first FAQ here):

The HD and Phone models require 1920x1080 video as their inputs.
The 4K model requires 3840x2160 as its input.

The mismatch between the source and destination videos’ resolution can be solved using bicubic interpolation in FFmpeg’s scale filter.

For example, to use the HD model (VMAF_v0.6.1.pkl) we need to scale the distorted video (if it isn’t 1920x1080) by using the following FFmpeg command –

ffmpeg -i <distorted> -i <reference> -lavfi "[0:v]scale=1920:1080:flags=bicubic[distorted];[distorted][1:v]libvmaf=model_path=/usr/local/share/model/VMAF_v0.6.1.pkl" -f null -

Similarly, to use the 4K model we need to scale the distorted video to 3840x2160.

ffmpeg -i <distorted> -i <reference> -lavfi "[0:v]scale=3840:2160:flags=bicubic[distorted];[distorted][1:v]libvmaf=model_path=/usr/local/share/model/VMAF_4k_v0.6.1.pkl" -f null -

The above examples consider that the reference video already matches with the resolution expected by the VMAF model.

If it is not the case, you can apply the same scalefilter to the reference also:

ffmpeg -i <distorted> -i <reference> -lavfi "[0:v]scale=3840:2160:flags=bicubic[distorted];[1:v]scale=3840:2160:flags=bicubic[reference];[distorted][reference]libvmaf=model_path=/usr/local/share/model/VMAF_4k_v0.6.1.pkl" -f null -

Frame Synchronization

VMAF requires frame alignment between the reference and the distorted videos, and so you have to guarantee that the frame rates, scan mode, and video durations match.

Here is how you can achieve frame-alignment.

Scan Mode Mismatch

H.264/AVC live sources are traditionally set to interlaced scan mode but once it goes through an OTT transcoder, the output is typically in progressive mode. So, if we want to compute VMAF, we first need to normalize the scan mode.

Given that the VMAF models were trained with progressive scan mode, it is suggested to always deinterlace the interlaced inputs.

The normalization of the scan mode could be done by FFmpeg using the yadif filter:

ffmpeg -i <distorted> -i <reference> -lavfi "[1:v]yadif=0:-1:0[ref];[0:v][ref]libvmaf=model_path=/usr/local/share/model/VMAF_v0.6.1.pkl" -f null -

The above commandline assumes that reference stream is interlaced, so it is passed to the yadif filter with the options mode:parity:deint = 0:-1:0. This means:

0: Output one frame per each frame in the input.
-1: Enable automatic detection of field parity.
0: Deinterlace all frames.

So if the filter input had a frame rate of 29.97i with interlaced scan mode, the yadif filter will output 29.97p in progressive mode.

This is the most typical conversion method for interlaced sources but you can also experiment with other options. For example, yadif=1:-1:0 will produce an 59.94p output for the same input.

Frame Rate Mismatch

First, you need to know that VMAF was not trained to deal with frame rate conversion issues, so here we will force the inputs to artificially work with VMAF.

Accordingly, the scores should be used cautiously and not as fully reliable values.

However, it could be useful to compute even this biased score because, in practice, it is pretty common to do Frame Rate conversion on ABR ladders.

Here again, to force frame rate conversion we are going to use another FFmpeg filter:

ffmpeg -i <distorted> -i <reference> -lavfi "[1:v]fps=fps=30[ref];[0:v][ref]libvmaf=model_path=/usr/local/share/model/VMAF_v0.6.1.pkl" -f null -

The fps filter allows you to set the frame rate by adding new frames (by copying frames) or discard them until reaching the desired value.

We prefer to leave the reference unmodified and apply the filter to only the distorted video.

First Frame Mismatch

Sometimes, the reference video may start at a different frame than the distorted sequence.

This mismatch in the “starting-frame” will result in drift between the two sequences and lead to wrong VMAF scores.

To correctly compute the VMAF scores, we need to align the first frames of both the reference and distorted videos. And one way to do this is by trimming the misaligned video sequence until we achieve frame-alignment.

We can trim the videos using FFmpeg’s trim filter and guarantee that the output will contain a continuous subset of the input.

But first, we need to determine the starting point and duration of the subset of frames that match in both the reference and distorted video sequences. The challenge here is to find the right starting point that we should pass to the trim filter.

To solve this, we propose a PSNR-based approach that computes the PSNR between the distorted and reference videos iteratively till a match is found.

Here are the details –

First, we extract two video samples consisting of the first m-frames of the distorted and reference video sequences and compute the PSNR between them. This is the PSNR for the first iteration. We suggest that you use only a small number of frames (i.e., m) to reduce computational costs.
In the second iteration, we re-compute the PSNR but sliding the distorted subsample forward by 1-frame and compute the PSNR again.
This process of “sliding and computing the PSNR” is repeated n times where n is the number of video frames that fit in the SyncWindow. The SyncWindow is a duration in which we want to find the right sync value.
If we are lucky (if the SyncWindow was chosen correctly), we will see that the best PSNR was on some i^th frame at the end of the process. That means that the i^th frame of the distorted sequence matches the 1st frame of the reference.

At the end of this process, we have enough information to apply the trim filter.

The iterations previously described are shown in the following figures. On each iteration (i) the PSNR is computed between the reference_subsample and the distorted_subsample_i.

In practice, the PSNR calculation at each iteration can be done as follows –

     while <distorted_subsample_i.1st_frame> IN <SYNC_WINDOW>:
         getPSNR(reference_subsample, distorted_subsample_i)
     <distorted_subsample_i>.next_frame()

And the getPSNR() function can be implemented with FFmpeg as follows.

 ffmpeg  -i <distorted> -i <reference> \
 -lavfi "[0:v]trim=start=<OFFSET>:duration=<M>,setpts=PTS-STARTPTS[distorted_subsample];\
 [1:v]trim=start=0:duration=<M,setpts=PTS-STARTPTS[reference_subsample];\
 [distorted_subsample][reference_subsample]psnr=stats_file=psnr.log" -f null -

where,

 <OFFSET>: (i-1)*1/fps, and fps is the framerate per second.
 <M>: Size in seconds of the subsample sequence. This is a fixed value for all the iterations.
 <SYNC_WINDOW>: window time (seconds) in which we want to find the right sync value

To give you an example, if we choose a <SYNC_WINDOW> value = 0.3 seconds for video sequences with fps = 30, we would get something as follows –

 iteration    offset(s)                   psnr[dB]
   1            0.0                       21.098356
   2            0.03333333333333333       21.132783
   3            0.06666666666666667       21.167991
   4            0.1                     21.204151
   5            0.13333333333333333     21.248292
   6            0.16666666666666666     21.29118
  *7          0.2                       33.675342
   8            0.23333333333333334     21.363845
   9            0.26666666666666666     21.409776
   10            0.3                    21.451546

Based on these values, the best PSNR was found on the 7th iteration.

So the 7th frame of the distorted sequence (located 0.2 seconds from its start) matches the 1st frame of the reference sequence.

With this information, we can trim the sequence to compute VMAF using FFmpeg.

ffmpeg -i <distorted> -i <reference> -lavfi "[0:v]trim=start=<OFFSET>,setpts=PTS-STARTPTS[distorted];[distorted][reference]libvmaf=model_path=/usr/local/share/model/VMAF_v0.6.1.pkl" -f null -

In practice, sometimes we also will need to pass the duration parameter to trim (trim=start=<OFFSET>;duration=<LENGTH>) in order to guarantee that the distorted and reference sequences have the same length in seconds.

Putting It All Together using easyVMAF

Until now, we discussed the process of pre-processing your videos before computing VMAF using a few simple examples.

However, in practice, it is pretty common that you need to apply all or most of the normalizations previously described at the same time, i.e., upscale, deinterlace it, change the frame rate (for example from 29.97 to 30fps), and synchronize the distorted and references frames in the time dimension.

Instead of manually performing them, the entire process is automated in easyVMAF, a Python script to complete the normalization processes required by VMAF.

easyVMAF uses FFmpeg and FFprobe for all the necessary video editing and information gathering. It allows us to perform Deinterlacing, Up/Downscaling, Frame Synchronization, Frame Rate adaptation.

The next diagram shows a high-level overview of easyVMAF.

If you are curious about easyVMAF, please go to the GitHub repo where you can check out a Docker image or browse through the source code.

Please check it out and try it today on your videos!

If you have any suggestions, improvements, or you want to contribute, feel free to submit PRs.

Before we end this article, here is the procedure for running the Docker image and an explanation of the commandline parameters. Thank you!

 docker run --rm  gfdavila/easyVMAF -h
 usage: easyVMAF [-h] -d D -r R [-sw SW] [-ss SS] [-subsample N] [-reverse]
                 [-model MODEL] [-phone] [-verbose] [-output_fmt OUTPUT_FMT]
 Script to easily compute VMAF using FFmpeg. It allows to deinterlace, scale, and sync Ref and Distorted video samples automatically:                         
       Autodeinterlace: If the Reference or Distorted samples are interlaced, deinterlacing is applied                        
       Autoscale: Reference and Distorted samples are scaled automatically to 1920x1080 or 3840x2160 depending on the VMAF model to use                        
       Autosync: The first frames of the distorted video are used as a reference to do sync lookup with the Reference video.                         
            The sync is doing by a frame-by-frame lookup of the best PSNR                        
 See [-reverse] for more options for syncing                        
 As output, a json file with the VMAF score is created
 Optional arguments:
   -h, --help            show this help message and exit
   -sw SW                Sync Window: window size in seconds of a subsample of the Reference video. The sync lookup will be done between the first frames of the Distorted input and this Subsample of the Reference. (default=0. No sync).
   -ss SS                Sync Start Time. Time in seconds from the beginning of the Reference video to which the Sync Window will be applied from. (default=0).
   -subsample N          Specifies the subsampling of frames to speed up calculation. (default=1, None).
   -reverse              If enabled, it Changes the default Autosync behaviour: The first frames of the Reference video are used as reference to sync with the Distorted one. (Default = Disable).
   -model MODEL          VMAF Model. Options: HD, HDneg*, 4K. (Default: HD).
   -phone                It enables VMAF phone models (HD only). (Default=disable).
   -verbose              Activate verbose loglevel. (Default: info).
   -output_fmt OUTPUT_FMT
                         Output VMAF file format. Options: json or xml (Default: json)
 required arguments:
   -d D                  Distorted video
   -r R                  Reference video 
 * NOTE: HDneg is a VMAF experimental feature not supported yet by FFmpeg.

Are you interested in writing for OTTVerse.com? Do you have a deep-tech article, tutorial, or a business analysis that you would like us to publish? Please use the Contact Form and pitch us your idea!

Gabriel Dávila Revelo

Gabriel Dávila Revelo is a Sales Engineer at Bitmovin. His interest focuses especially on media and video technologies and video services’ architectures over the cloud. He received his M. S. in Telecommunication Engineering from Buenos Aires University