An InterFrame is a frame in a video compression stream that is expressed as one or more adjacent frames. Interframes are ideal to use in 3D configurator projects. The “Inter” part of the term refers to the use of InterFrame prediction. This type of prediction attempts to exploit the advantages of temporal redundancy between adjacent frames to enable higher compression rates.

What is an InterFrame?

InterFrame forecast.

An intercoded frame is divided into blocks called macro blocks. Then, instead of directly encoding the raw pixel values for each block, the encoder tries to find a block similar to the one it encodes on a previously encoded frame called a reference frame. This process is performed by a block adjustment algorithm. If the encoder is successful in the search, the block could be encoded by a vector called the motion vector, which points to the position of the matching block on the reference frame. The process of motion vector determination is called motion estimation.

In most cases, the encoder will be successful, but the found block will probably not match exactly the block it is encoding. For this reason, the encoder calculates the differences between them. These residual values are referred to as prognosis errors, are transported and sent to the decoder.

In summary, if the encoder succeeds in finding a matching block on a reference frame, it will receive a motion vector pointing to the matching block and a prediction error. With both elements, the decoder can restore the raw pixels of the block.

This type of prediction has some advantages and disadvantages:

  • If all goes well, the algorithm can find a suitable block with less forecast error, so that the total size of the motion vector plus forecast error after the transformation is smaller than the size of a raw encoding.
  • If the block matching algorithm does not find a suitable match, the forecast error is considerable. The total size of the movement vector plus the forecast error is therefore larger than the raw encoding. In this case, the encoder would make an exception and send raw encoding for this specific block.
  • If the adjusted block on the reference frame was also encoded with the InterFrame forecast, the coding errors are passed on to the next block. If each frame were encoded using this technique, there would be no way for a decoder to synchronize with a video stream, as it would be impossible to obtain the reference images.

Due to these disadvantages, a reliable and temporally periodic reference frame must be used for this technique to be efficient and useful. This reference frame is called an IntraFrame, which is strictly intra-coded so that it can always be decoded without additional information.

In most designs there are two types of InterFrames: p-frames and b-frames. These two types of frames and the i-frames (intra-coded images) usually combine to form a GOP (Group of Pictures). The iFrame does not require any additional information to be decoded and can be used as a reliable reference. This structure also makes it possible to achieve the iFrame periodicity required for decoder synchronization.

Frame Types.

The difference between p-frames and b-frames is the reference frame they are allowed to use.


P-frame is the term used to define the forward predicted images. The prediction is made from an earlier image, mainly an i-frame, so that less coding data is needed (about 50% compared to the i-frame size).

The amount of data required for this prediction consists of motion vectors and transformation coefficients that describe the prediction correction. This is the use of motion compensation.


B-frame is the term for bidirectionally predicted images. This type of prediction method occupies less coding data than P-frames (about 25% compared to the I-frame size) because they can be predicted or interpolated from an earlier and/or later frame. Similar to P-frames, B-frames are expressed as motion vectors and transformation coefficients. To avoid a growing propagation error, B-frames are not used as a reference for further predictions in most coding standards. However, with newer coding methods (e.g., AVC), B-frames can be used as a reference.

Typical group of images (GOP) structure.

The typical Group of Pictures (GOP) structure is IBBPBBBP…. The I frame is used to predict the first P frame, and these two frames are also used to predict the first and second B frames. The second P-frame is also predicted with the first I-frame. Both P-frames join together to predict the third and fourth B-frames.

This structure indicates a problem because the fourth frame (a P-frame) is needed to predict the second and third (B-frames). So we have to send the P-frame before the B-frames and it will delay the transmission (it will be necessary to keep the P-frame). This structure has its strengths:

  • It minimizes the problem of possible uncovered areas.
  • P and B frames require less data than I frames, so less data is transferred.

But it has vulnerabilities:

  • It increases the complexity of the decoder, which can mean that more memory is needed to rearrange the frames.
  • The interpolated frames (namely B-frames) need more motion vectors, which means an increased bit rate.

H.264 – Improvements in the prediction of intermediate images.

The most important improvements of this technique compared to the previous H.264 standard are:

  • More flexible block partitioning
  • Resolution up to ¼ Pixel motion compensation
  • Several references
  • Improved Direct/Skip Makroblock

More flexible block partition.

Luminance block partition of 16×16 (MPEG-2), 16×8, 8×16 and 8×8. The latter case allows the division of the block into new blocks of 4×8, 8×4 or 4×4.

The frame to be encoded is divided into blocks of the same size, as shown in the figure above. Each block forecast is a block of the same size as the reference images, offset by a small shift.

Resolution up to ¼ Pixel motion compensation.

Pixels at the half-pixel position are obtained by applying a filter of length 6.

H=[1 -5 20 20 -5 1]

For instance:

b=A – 5B + 20C + 20D – 5E + F

The pixels at the position of the quarter pixel are obtained by bilinear interpolation.

While MPEG-2 allowed a ½ pixel resolution, InterFrame allows up to ¼ pixel resolution. This means that it is possible to search a set in the frame that is to be encoded in other reference frames, or we can interpolate non-existent pixels to find blocks that fit even better to the current block. So if the motion vector is an integer number of scanning units, it is possible to find the compensated block in motion in reference images. If the motion vector is not an integer, the prediction from interpolated pixels is obtained by an interpolar filter in horizontal and vertical directions.

Multiple references.

Multiple references to motion estimation make it possible to find the best reference in 2 possible buffers (list 0 on previous images, list 1 on future images), each containing up to 16 images. The block prediction is done by a weighted sum of blocks from the reference image. It provides improved image quality in scenes where layer changes, zooming, or new objects are discovered.

Improved Direct/Skip Macroblock.

Skip and Direct modes are very common, especially for B-frames. They significantly reduce the number of bits to be encoded. These modes are used when encoding a block without sending residual errors or motion vectors. The encoder only records that the block is a skip acrobatic block. The decoder derives the motion vector of the Direct/Skip Mode coded block from other already decoded blocks.

There are two possible ways to derive the motion.


It uses the block motion vector ais the frame of the first list located at the same position to derive the motion vector. The block list 1 uses a block list 0 as reference.


It predicts the motion of neighboring macro blocks with the same frame. A possible criterion could be to copy the motion vector from an adjacent block. These modes are used in even zones of the image where there is not much motion.

Additional information.

Although the use of the term “frame” is common in informal use, in many cases (e.g. in international standards for video encoding by MPEG and VCEG) a more general term is used, using the word “picture” instead of “frame”, where an image can be either a complete frame or a single interlaced field.

Video codecs such as MPEG-2, H.264 or Ogg Theora reduce the amount of data in a stream by following key frames with one or more intermediate frames. These images can typically be encoded at a lower bit rate than key frames because much of the image is usually similar, so only the changing parts need to be encoded.

If you have any questions or suggestions, please feel free to contact our experts in our forum.

Thank you very much for your visit.