OpenCV Camera Calibration mathematical background

Question

I just started working with OpenCV and calibrated two cameras. Iam calibrating the camares in python with the use of a chess board. Iam using the functions drawChessboardCorners and calibrateCamera. Everything works fine.

The documentation of these functions describes how to work with these functions. But I have been wondering what the magic behind those is. Iam wondering about the mathematical background of camera calibration in OpenCV.

How is OpenCV detecting the corners of the chessoard?

How exactly is this used to calibrate the camera?

Paul92 · Accepted Answer

In order to understand what camera calibration actually is, let's start with the way the images are formed.

A camera is basically a device which makes a transformation (known as project) of points from 3D space into a 2D space (the image space). In the analysis of the image formation, we quite often use what is known as a pinhole camera model, where images are formed like this:

A bit more diagrammatic, we can see the image formation like this:

, where Y1 is the image plane, x3 is the distance from the camera to the object (which we will call z, the depth), x1 is the displacement on the X1 axis of the 3D point P from the optical axis of the camera X3. O is the camera with focal length f, and y1 is the distance between the center of the image, and the pixel Q corresponding the the point P.

The simplest projection model is known as orhtography. This model simply drops the depth coordinate of the 3D point (and possibly scales it). So, if we start from a point P in the 3D world

, we can write the projection as:

, where s is a real scaling factor and the matrix pi is the projection matrix.

This model is approximate for telephoto lenses (long focal lengths) and shallow objects with respect to their distance to the camera. It is exact only for telecentric lenses. A more accurate model to the cameras we use is the perspective projection. To get an intuition, the objects in the image plane seem bigger if the 3D object is closer to the camera. A bit more mathematically, due to the triangle similarity, y1 is proportional to x1. The proportionality factor is f/x3, or f/z. Letting f be 1 for the time being, this leads to the following projection function:

As you see, the projection can't be represented as a matrix multiplication, since it's not a linear transformation anymore. Which is not ideal - matrix multiplications have very nice properties. So, we introduce a trick known as homogeneous coordinates. For each point, we add another coordinate (so 2D points are now represented using 3 coordinates and 3D points are represented using 4 coordinates), and we keep the fourth coordinate normalized to 1 (think about an implicit division by the last coordinate).

Now, our point P is becomes:

and we can write a perspective projection matrix as:

, where the last division happens "implicitly" due to our usage of homogeneous coordinates and the tilde indicates a vector in homogeneous coordinates.

And there you have it! That is the perspective projection matrix. And note that this is a non-invertible transformation.

However, the cameras not only project the 3D points on a 2D image plane. After the projection, they perform a transformation to the discrete image space. This is represented by a matrix known as the intrisinc camera matrix K:

, where fx and fy are the independent focal lengths on the x and y axis (which is usually reasonable to assume are equal), s is a skew that accounts for the image axis not being perpendicular to the optical axis (in modern cameras is close to 0) and cx, cy represent the origin of the image (usually the center of the image).

Cameras usually add some distortion to the image, and there are different mathematical models for them.

The camera calibration process refers to determining the intrinsic camera matrix and the parameters for the distortion models.

This can be done by the following rough process:

Start from multiple images from different perspectives of a model of known layout and size
For each image, determine some points of known correspondence. These are usually corners, since they can easily and reliably be matched against each other.
Each of this points has an associated homography matrix H = l * K * (R|T), where l is a real scaling factor, K is the intrinsic camera matrix, (R|T) is a matrix which represents the camera rotation and translation in the 3D space (this is known as the extrinsic camera matrix).
Based on the point correspondences and using the homographies, there are closed form solutions that determine the intrinsic camera parameters. Without distortions, at least 3 images are required. Assuming skew is 0, at least 2 images are required. In practice, more images lead to more accurate results.
Once the intrinscs are known, the extrinsic parameters (rotation and translation) for each view can be computed
Distortions can also be estimated.
Some procedures use an optimization across all images to further refine the results from the closed form solution.

To see the actual closed form equations, have a look at a very nice paper by W. Burger (Burger, 2016).

The "bible" of this field is Multiple View Geometry in Computer Vision, by A. Zisserman.

OpenCV Camera Calibration mathematical background

Answers (2)

Related Questions