Ever wondered how in Star Wars holographic projections were made for transferring messages? And in Iron Man, where Tony Stark plays with his hi-tech interactive laboratory, designing his super-cool suits and weapons? In both the movies, virtual projections were interacting with real-world characters—this is augmented reality (AR).
AR’s application in everyday life is on the rise. For example, when you take a photo on your mobile phone of an unknown object, there exists an application that tells you what that object is. How does this happen?
For this to happen, there should be something that interacts with the real world and serves as a bridge to the processor. This is usually done by a camera.
Now, this is an area where AR is made use of. There are some highly-advanced systems built recently, which can view the real world as people do, calculate the distance of each object that surrounds the system, help people build real-world objects using 3D printing by using perceived data, understand where and in what position the device holder stands, among other things. Three examples for such systems are Google Glass, Meta Augmented Reality glasses and Oculus Rift. This article explores the hardware used in these systems, which primarily act as a source that interacts with the real world.
Whenever you take a photo using a mobile phone or digital single-lens reflex (DSLR) camera, the image that is formed is in 2D format. Speaking from the perspective of the camera itself, it does not understand the scene in 3D, as people do. This is because the scene is caught using a single camera.
From a science perspective, in order to perceive the depth, or 3D sense, two cameras are required. This is achieved by a specialised camera called stereo camera.
How the stereo camera works
The rudimentary principle of a stereo camera is binocular vision, which is the exact same principle a human brain applies to perceive its surroundings. Speaking of human beings, we have two eyes, with some distance in between, which is called interocular distance. Each retina produces a slightly different image of the same object. This is the primary reason why we are able to see things in 3D. If we possessed a single eye, understanding depth would be nearly impossible.
Consider Fig. 1, where a small notebook, when looked at using both eyes, will look perfectly rectangular with an accurate sense of depth in it. But if the same notebook is held at an arm’s distance and viewed using, say, the left eye alone, then the middle part and the extreme-left edge could be viewed. Similarly, when viewed using the right eye alone, the right edge would be visible. When the notebook is viewed using both eyes, the brain stitches the images from both the retinas and renders a beautiful 3D view of that same.
Note that, this is the reason why in 3D cinemas, without using 3D glasses, pictures and characters get misaligned with two different colour bases (cyan and red). This essentially explains the science behind the working of a stereo camera.
Constructing an image
Based on the above principle, a stereo camera is designed to perceive depth using two cameras. These cameras are separated by a distance called baseline, which is nothing but the interocular distance. Usually this distance varies anywhere between 55mm and 85mm.
A camera is nothing but an image sensor, which actively converts light into electrical signals. This is basically analogue-to-digital conversion, and a single frame of image that is captured is converted into a stream of bits.
There are many ways in which these bits are delivered from the image sensor. A few examples are parallel interface, low-voltage differential signalling (LVDS), MIPI interface and HiSpi interface.
In a stereo camera, there are two streams of data bits originating from two cameras. The complexity arises here. There are two image sensors focusing on a single object and the expectation is that, these two image sensors produce exactly the same output, meaning, the same stream of bits. But since the two cameras are separated by some distance, even though both cameras focus on the same object, there is a slight difference in both sets of outputs. Fig. 2 shows such a set of images captured using a stereo camera.
These two different streams of output are then fed to the processor, which accepts stereoscopic inputs. Note that, not all processors available in the market support stereo input. The processor treats this difference between the two images and, using depth-perceiving algorithms, calculates the depth or the distance at which that particular object is placed. This is the simple working principle of a stereo camera, which is used in a system like Oculus Rift, whose complete setup is shown in Fig. 3.
While stereo cameras are a great source for 3D images or videos, these are not the only source using which 3D recreation is done for AR. Another interesting way of recreating 3D images is using a concept called time of flight (ToF).