0%

Capture Rectilinear RGBD Data with iPhone

A rectilinear RGBD image will help a lot in computer vision tasks, such as 3D reconstruction. This tutorial will illustrate how to capture RGBD data with iPhone, export the data to Python, and rectify the data. Code in this blog is available in this repository.

Capturing Photo with Depth Map

For iOS devices with a dual camera or a TrueDepth camera, they are able to provide depth data when capturing an image. However, the depth data is provided only when requested. This section will focus on capturing a photo with depth data using AVFoundation.

If you still don't know how to take an image with AVFoundation, some other tutorials will help, such as Capturing Still and Live Photos. Briefly speaking, to capture an image with AVFoundation, a capture session needs to be built and configured.

1
captureSession = AVCaptureSession()

The configuration includes the capture device configuration, input configuration, output configuration and handling, preview output, and capture settings. By adjusting some configurations in the session, the depth data will be provided along with the capture result. The rest part of this section will focus on configuring such a session.

Capture Device Configuration

According to Apple's document, builtInDualCamera and builtInTrueDepthCamera are able to provide photos with depth maps. To make it possible for your pipeline to output a depth map, you need to select one of these two cameras when instantiating your AVCaptureDevice.

Note: Although both cameras are able to output depth maps, the two cameras are based on different principles. builtInDualCamera will conclude the depth data using binocular stereo vision algorithms, which seeks feature points in the two images captured by two cameras and conclude the depth. This algorithm will be easily affected: darkness or textureless images will cause problems for this algorithm. builtInTrueDepthCamera, on the other hand, is a structured light system, like Kinect. It projects infrared laser dots to the object, then analyzes the pattern of the projected dots to conclude the depth. In brief, depth maps provided by builtInTrueDepthCamera are much more accurate than it provided by builtInDualCamera.

1
2
3
4
5
imageCaptureDevice = AVCaptureDevice.default(
.builtInTrueDepthCamera,
for: .depthData,
position: .unspecified
)

Input / Output Configuration

The following code will help to configure the input and output of the capture session. Whenever you need to change the configuration, remember to call beginConfiguration() and commitConfiguration() before and afterward.

1
2
3
4
5
deviceInput = try! AVCaptureDeviceInput(device: imageCaptureDevice)
captureSession.beginConfiguration()
captureSession.sessionPreset = .photo
captureSession.addInput(deviceInput)
captureSession.commitConfiguration()

By setting isDepthDataDeliveredEnabled to true of the photo output, you are one step closer to outputting the depth map.

1
2
3
4
5
captureSession.beginConfiguration()
photoOutput = AVCapturePhotoOutput()
captureSession.addOutput(photoOutput)
captureSession.commitConfiguration()
photoOutput.isDepthDataDeliveryEnabled = true

Video Preview Setup

Theoretically, you can start capturing the image with such a configuration. But no one wants a camera which they can't see anything when they are taking a photo. Thus, we need to set up a preview layer for the session so that people will see what will be captured before they press the button. You can instantiate the preview layer of the session with the following code.

1
2
previewLayer = AVCaptureVideoPreviewLayer(session: captureSession)
previewLayer.videoGravity = .resizeAspectFill

Then, by inserting the layer into a UIView object and adjust its frame property, you can make the preview layer visible.

1
2
3
4
5
6
7
8
9
10
11
override func viewDidLoad() {
super.viewDidLoad()
// ...
previewContainerView.layer.insertSublayer(previewLayer, at: 0)
}

override func viewDidLayoutSubviews() {
super.viewDidLayoutSubviews()
// ...
previewLayer.frame = previewContainerView.bounds
}

Running and Stopping the Session

The AVCaptureSession is a resource-consuming task. By starting and stopping the session in proper time, you will prevent wasting extra energy on the cell phone. After starting the session, you should be able to see the preview working.

1
2
3
4
5
6
7
8
9
10
11
override func viewWillAppear(_ animated: Bool) {
super.viewWillAppear(animated)
// ...
captureSession.startRunning()
}

override func viewWillDisappear(_ animated: Bool) {
super.viewWillDisappear(animated)
// ...
captureSession.stopRunning()
}

Capturing the RGBD Data

When the user (or, more likely, you) press the capture button, it's time to tell the capture session to capture an image. This step needs you to pass an AVCapturePhotoSettings parameter which specifies the setting of this capture. Setting isDepthDataDeliveryEnabled to true will let the capture session passing you the depth data. The raw depth data contain lots of holes where the depth data is not available. Setting isDepthDataFiltered to true will tell the system to fill those wholes. Of course, you can set it to false and choose your own algorithm to handle the holes. Just mind the representation of those holes.

1
2
3
4
let photoSettings = AVCapturePhotoSettings()
photoSettings.isDepthDataDeliveryEnabled = true
photoSettings.isDepthDataFiltered = true
photoOutput.capturePhoto(with: photoSettings, delegate: self)

You can get the result of the capture by delegating AVCapturePhotoCaptureDelegate. The processed result will be passed to photoOutput(_:didFinishProcessingPhoto:error:).

1
2
3
4
5
6
7
func photoOutput(
_ output: AVCapturePhotoOutput,
didFinishProcessingPhoto photo: AVCapturePhoto,
error: Error?
) {
// Result handling
}

Data Type Convert and Export

You are able to capture an image with a depth map now. However, the data is still represented in Apple's own format. If you feel better working with other frameworks such as Python (like me), exporting the captured data will be a good idea.

Export the Image

Compared with the depth map, the image itself is not hard to obtain. by calling fileDataRepresentation() or cgImageRepresentation()!.takeUnretainedValue() of the AVCapturePhoto object, you are able to access the jpeg / HEVC data or CGImage object of the captured image. Then the data is ready to be shared. You can save the data to the user's album or use other methods such as presenting UIActivityViewController or posting the data to a server to share it.

Export the Depth Map

Compared with the colored image, the depth map is a little bit hard to read since it's not actually an image but a 2-dimensional array. The AVCapturePhoto's depthData property is used to store the depth map related data, while the depthMap property of AVDepthData is used to contain the depth map itself, which is represented as a CVPixelBuffer object. There are some other properties of AVDepthData that help us rectifying and reconstructing the RGBD data, and they should also be recorded.

Convert Depth Map

My original thought is to directly export the depth map to a JSON file format. Unluckily, the CVPixelBuffer object does not conform to Codable protocol. Another thought is to read the data one by one and write them to another array, then wrap it into a JSON file. This sounds a little stupid, but it works.

The implementation is to get the pointer of the CVPixelBuffer, then cast the bits to Float32, and read the values one by one. Before casting, you will need to use the following line to convert the data in the pixel buffer to 32 bit values. If not, the data might be 16 bit, and it could be disparity values, not depth.

1
let convertedDepthMap = photo.depthData!.converting(toDepthDataType: kCVPixelFormatType_DepthFloat32).depthDataMap

And here is the implementation that converts a CVPixelBuffer object containing depth data into a Float32 array.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
func convertDepthData(depthMap: CVPixelBuffer) -> [[Float32]] {
let width = CVPixelBufferGetWidth(depthMap)
let height = CVPixelBufferGetHeight(depthMap)
var convertedDepthMap: [[Float32]] = Array(
repeating: Array(repeating: 0, count: width),
count: height
)
CVPixelBufferLockBaseAddress(depthMap, CVPixelBufferLockFlags(rawValue: 2))
let floatBuffer = unsafeBitCast(
CVPixelBufferGetBaseAddress(depthMap),
to: UnsafeMutablePointer<Float32>.self
)
for row in 0 ..< height {
for col in 0 ..< width {
convertedDepthMap[row][col] = floatBuffer[width * row + col]
}
}
CVPixelBufferUnlockBaseAddress(depthMap, CVPixelBufferLockFlags(rawValue: 2))
return convertedDepthMap
}

Convert Calibration Data

The camera calibration data will provide import calibration data. Here are some fields that I'm interested in.

  • intrinsicMatrix: This matrix will provide information about the optical center and the focal length camera.
  • pixelSize: The size of the pixel. This is useful since values in intrinsicMatrix are expressed in pixels.
  • intrinsicMatrixReferenceDimensions: Since the size of the depth map and the colored image are different, it will be useful to know which size are the reference coordinate and convert the related data if necessary.
  • lensDistortionCenter: Useful when correcting the lens distortion.
  • lensDistortionLookupTable: This is used to provide the necessary magnification values when rectifying a distorted image.
  • inverseLensDistortionLookupTable: This is used to provide the necessary magnification values when distorting a rectified image.

The above properties except lensDistortionLookupTable and inverseLensDistortionLookupTable are obvious to be converted to float or float array values. For these two properties of type Data, they are actually an array of Float and can be converted to Float array with the following method.

1
2
3
4
5
6
func convertLensDistortionLookupTable(lookupTable: Data) -> [Float] {
let tableLength = lookupTable.count / MemoryLayout<Float>.size
var floatArray: [Float] = Array(repeating: 0, count: tableLength)
_ = floatArray.withUnsafeMutableBytes{lookupTable.copyBytes(to: $0)}
return floatArray
}

Data Wrapping

With the help of JSONSerialization, it's possible to wrap the depth map and calibration data into a Data object, which can be decoded as a JSON string. In Swift, you can directly write the Data object to disk and get the URL, and then it's an open question for you to share the file. Here's a sample implementation of the data wrapper.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
func wrapEstimateImageData(
depthMap: CVPixelBuffer,
calibration: AVCameraCalibrationData,
) -> Data {
let jsonDict: [String : Any] = [
"calibration_data" : [
"intrinsic_matrix" : (0 ..< 3).map{ x in
(0 ..< 3).map{ y in calibration.intrinsicMatrix[x][y]}
},
"pixel_size" : calibration.pixelSize,
"intrinsic_matrix_reference_dimensions" : [
calibration.intrinsicMatrixReferenceDimensions.width,
calibration.intrinsicMatrixReferenceDimensions.height
],
"lens_distortion_center" : [
calibration.lensDistortionCenter.x,
calibration.lensDistortionCenter.y
],
"lens_distortion_lookup_table" : convertLensDistortionLookupTable(
lookupTable: calibration.lensDistortionLookupTable!
),
"inverse_lens_distortion_lookup_table" : convertLensDistortionLookupTable(
lookupTable: calibration.inverseLensDistortionLookupTable!
)
],
"depth_data" : convertDepthData(depthMap: depthMap)
]
let jsonStringData = try! JSONSerialization.data(
withJSONObject: jsonDict,
options: .prettyPrinted
)
return jsonStringData
}

Displaying in Jupyter Notebook

I choose to POST the image as a .jpg file and the wrapped depth data as a .json file to my server. To present the data you've got, using Jupyter Notebook is one of the easiest ways. You can extract the colored image and the depth map as numpy.array and visualize them using matplotlib.pyplot.

1
2
3
4
5
with open('/path/to/file.json') as in_file:
json_content = json.loads(in_file.read())
depth_map = np.array(json_content['depth_data']).astype('float32')
plt.imshow(depth_map)
plt.colorbar()
1
2
image = np.array(Image.open('/path/to/file.jpg'))
plt.imshow(image)

Color Depth

Rectify Distorted RGBD Data

Now we have obtained the colored image and the depth map, as well as necessary camera intrinsics. However, there's one thing that is not satisfied: the distortion. The colored image is distorted since they are not taken by pin-hole camera. The lenses will distort the image when they allow more light to come in. The depth map, in order to match the colored image, is also manually distorted by Apple. To improve the accuracy on computer vision tasks, both of them are supposed to be rectilinear, say, undistorted. Luckily, Apple has provided the necessary data to help us to undistort or re-distort the image: lensDistortionLookupTable and inverseLensDistortionLookupTable.

For a more detailed explanation, you should refer to WWDC 17 session: Capturing Depth in iPhone Photography. In brief, the lookup table provides the magnification along the radius of the image: the line from the distortion center to the farthest corner of the image. In order to compensate the distortion, we can use the lookup table to determine the original coordinate of the pixel and put them in the right position.

By passing lensDistortionLookupTable, this C method will help to calculate the distorted point position of a given point. It determines the radius of the image, the distance between the provided point and the distortion center (radius_point), calculates the magnification by linear interpolation values in the lookup table, and finally concludes the position after distortion using the calculated magnification. You can pass the inverseLensDistortionLookupTable to do the inverse calculation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
double* get_lens_distortion_point(
int* point,
double* lookup_table,
int lookup_table_len,
double* distortion_center,
int* image_size
) {
double radius_max_x = distortion_center[0];
double radius_max_y = distortion_center[1];
if (image_size[0] - radius_max_x > radius_max_x) {
radius_max_x = image_size[0] - radius_max_x;
}
if (image_size[1] - radius_max_y > radius_max_y) {
radius_max_y = image_size[1] - radius_max_y;
}
double radius_max = sqrt(pow(radius_max_x, 2) + pow(radius_max_y, 2));

double radius_point_x = point[0] - distortion_center[0];
double radius_point_y = point[1] - distortion_center[1];
double radius_point = sqrt(pow(radius_point_x, 2) + pow(radius_point_y, 2));

double magnification = lookup_table[lookup_table_len - 1];
if (radius_point < radius_max) {
double relative_position = radius_point / radius_max * (lookup_table_len - 1);
double frac = relative_position - (int)floor(relative_position);
double lower_lookup = lookup_table[(int)floor(relative_position)];
double upper_lookup = lookup_table[(int)ceil(relative_position)];
magnification = lower_lookup * (1.0 - frac) + upper_lookup * frac;
}
double* mapped_point = (double*)malloc(sizeof(double) * 2);
mapped_point[0] = distortion_center[0] + radius_point_x * (1.0 + magnification);
mapped_point[1] = distortion_center[1] + radius_point_y * (1.0 + magnification);
return mapped_point;
}

To rectify the whole image, you will need to create a new empty image. For each pixel in the new image, calculate the distorted position of the point, then fill the pixel with the value of the distorted position in the distorted image. This is a time-consuming step since you will need to iterate all pixels in the image, that is the reason why this method is implemented in C instead of Python or Swift. The implementation is as follows.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
double* rectify_image(
double* image,
int width,
int height,
int channel,
double* lookup_table,
int lookup_table_len,
double* distortion_center
) {
int image_size[2] = {width, height};
double* rectified_image = (double*)malloc(sizeof(double) * width * height * channel);
for (int i = 0; i < width; i ++) {
for (int j = 0; j < height; j ++) {
int rectified_index[2] = {i, j};
double* original_index = get_lens_distortion_point(
rectified_index,
lookup_table,
lookup_table_len,
distortion_center,
image_size
);
if ((int)original_index[0] < 0 || (int)original_index[0] >= width ||
(int)original_index[1] < 0 || (int)original_index[1] >= height) {
continue;
}
memcpy(
rectified_image + (i * height + j) * channel,
image + ((int)original_index[0] * height + (int)original_index[1]) * channel,
channel * sizeof(double)
);
free(original_index);
}
}
return rectified_image;
}

Besides, you will want to write another method to free the allocated memory since the returned values are allocated but not freed in the method.

1
2
3
void free_double_pointer(double* ptr) {
free(ptr);
}

Suppose your .c file name is undistort.c, the following command will compile the code to a .so file, which makes it possible for you to bridge this method to Python using ctypes.

1
cc -fPIC -shared -o undistort.so undistort.c

Finally, it's time to wrap the C method into a Python method. ctypes will help you to convert the data type when passing the arguments. Though Python is a dynamic type language, you have to be extremely cautious with data types here since you are communicating with C. The numpy array is type converted when passing as the argument since the image may not be of type double (for instance, the default dtype for the colored image is uint_8).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
undistort_dll = ctypes.CDLL('/path/to/file.so')
def rectify_image_c(image, lookup_table, distortion_center):
c_rectify_image = undistort_dll.rectify_image
c_rectify_image.restype = ctypes.POINTER(ctypes.c_double * functools.reduce(lambda x, y: x * y, image.shape))
c_free_double_pointer = undistort_dll.free_double_pointer
c_free_double_pointer.restype = None
channel = 1 if len(image.shape) < 3 else image.shape[2]
original_datatype = image.dtype
image = np.ascontiguousarray(image.astype('double'))
raw_result = c_rectify_image(
image.ctypes.data_as(ctypes.POINTER(ctypes.c_double)),
image.shape[0],
image.shape[1],
channel,
(ctypes.c_double * len(lookup_table))(*lookup_table),
len(lookup_table),
(ctypes.c_double * 2)(*distortion_center)
).contents
reshaped_result = np.reshape(raw_result, image.shape).astype(original_datatype)
c_free_double_pointer(raw_result)
return reshaped_result

By calling the Python method and pass the image as numpy array along with lensDistortionLookupTable, you are able to rectify the image. Note that the position of lensDistortionCenter refers to intrinsicMatrixReferenceDimensions. If your image size does not conform to it, remap the position of lensDistortionCenter accordingly.

It's more obvious to use a gif to compare the difference between the original image and rectified image.

Comparison

Visualization

By making use of the rectified RBGD image as well as the camera calibration data, we are able to project the image into 3D space as a point cloud. This method is extremely easy to implement. Here, you will need the optical center coordinate and the focal length.

1
2
def _get_3d_coordinate(row, col, fl, oc_x, oc_y, depth):
return np.array([(row - oc_x) * depth / fl, (col - oc_y) * depth / fl, depth])

By using open3d or pptk, you will be able to visualize the point cloud.

Point Cloud