Image Formation
Carlo Tomasi
The images we process in computer vision are formed by light bouncing off surfaces in the world and
into the lens of the system. The light then hits an array of sensors inside the camera. Each sensor produces
electric charges that are read by an electronic circuit and converted to voltages. These are in turn sampled by
a device called a digitizer (or analog-to-digital converter) to produce the numbers that computers eventually
process, called pixel values. Thus, the pixel values are a rather indirect encoding of the physical properties
of visible surfaces.
Is it not amazing that all those numbers in an image file carry information on how the properties of
a packet of photons were changed by bouncing off a surface in the world? Even more amazing is that
from this information we can perceive shapes and colors. Although we are used to these notions nowadays,
the discovery of how images form, say, on our retinas, is rather recent. In ancient Greece, Euclid, in 300
B.C., attributed sight to the action of rectilinear rays issuing from the observer’s eye, a theory that remained
prevalent until the sixteenth Century when Johannes Kepler explained image formation as we understand
it now. In Euclid’s view, then, the eye is an active participant in the visual process. Not a receptor, but
an agent that reaches out to apprehend its object. One of Euclid’s postulates on vision maintained that any
given object can be removed to a distance from which it will no longer be visible because it falls between
adjacent visual rays. This is ray tracing in a very concrete, physical sense!
Today, we know that image formation works the other way around, with light leaving surfaces in the
world, passing through a lens, and hitting receptive elements that are sensitive to light. The study of this
process can be divided into what happens up to the point when light hits the sensor, and what happens
thereafter. The first part occurs in the realm of optics, the second is a matter of electronics. We will look at
the geometry of optics first and at what is called sensing (the electronic part) thereafter.
1 Optics
A camera projects light from surfaces onto a two-dimensional sensor. Our idealized model for the geometry
of this projection is the so-called pinhole camera model, which is briefly described next. All rays in this
model, as we will see, go through a small hole, and therefore form a star of lines. Real lenses behave
differently from these idealized models, and we will see the main differences below.
1.1 The Pinhole Camera
A pinhole camera is a box with five opaque faces and a translucent one (Figure 1(a)). A very small hole
is punched in the face of the box opposite to the translucent face. If you consider a single point in the
world, such as the tip of the candle flame in the figure, only a thin beam from that point enters the pinhole
and hits the translucent screen. Thus, the pinhole acts as a selector of light rays: without the pinhole and
the box, any point on the screen would be illuminated from a whole hemisphere of directions, yielding a
1
2
1 OPTICS
uniform coloring. With the pinhole, on the other hand, an inverted image of the visible world is formed on
the screen. When the pinhole is reduced to a single point, this image is formed by the star of rays through
the pinhole, intersected by the plane of the screen. Of course, a pinhole reduced to a point is an idealization:
no power would pass through such a pinhole, and the image would be infinitely dim (black).
(a)
(b)
Figure 1: (a) Projection geometry for a pinhole camera. (b) If a screen could be placed in front of the
pinhole, rather than behind, without blocking the projection rays, then the image would be upside-up.
The fact that the image on the screen is inverted is mathematically inconvenient. It is therefore customary
to consider instead the intersection of the star of rays through the pinhole with a plane parallel to the screen
and in front of the pinhole as shown in Figure 1(b). This is of course an even greater idealization, since a
screen in this position would block the light rays. The new image is isomorphic to the old one, but upside-up.
In this model, the pinhole is called more appropriately the center of projection. The front screen is the
image plane. The distance between center of projection and image plane is the focal distance. The optical
axis is the line through the center of projection that is perpendicular to the image plane. The point where the
optical axis pierces the sensor plane is the principal point.
1.2 Lenses and Discrepancies from the Pinhole Model
The pinhole camera is a useful and simple reference system for talking about the geometry of image for-
mation. As pointed out above, however, this device has a fundamental problem: if the pinhole is large,
the image is blurred, and if it is small, the image is dim. When the diameter of the pinhole tends to zero,
the image vanishes.1 For this reason, lenses are used instead. Ideally, a lens gathers a whole cone of light
from every point of a visible surface, and refocuses this cone onto a single point on the sensor. Unfortu-
nately, lenses only approximate the geometry of a pinhole camera. The most obvious discrepancies concern
focusing and distortion.
Focusing Figure 2 (a) illustrate the geometry of image focus. In front of the camera lens2 there is a circular
diaphragm of adjustable diameter called the aperture. This aperture determines the width of the cone of rays
that hits the lens from any given point in the world.
1In fact, blurring cannot be reduced at will, because of diffraction limits.
2Or inside the block of lenses, depending on various issues.
pinholeprojection rayimage planeopticalaxispinholeprojection raysensor planeopticalaxis 1.2 Lenses and Discrepancies from the Pinhole Model
3
(a)
(b)
(c)
Figure 2: (a) If the image plane is at the correct focal distance (2), the lens focuses the entire cone of rays
that the aperture allows through the lens onto a single point on the image plane. If the image plane is either
too close (1) or too far (3) from the lens, the cone of rays from the candle tip intersects the image in a small
ellipse (approximately a circle), producing a blurred image of the candle tip. (b) Image taken with a large
aperture. Only a shallow range of depths is in focus. (c) Image taken with a small aperture. Everything is in
focus.
principal raylensimage plane123camera aperturein-focus planeoptical axisfocal distance 4
1 OPTICS
Consider for instance the tip of the candle flame in the Figure. If the image plane is at the wrong distance
(cases 1 and 3 in the Figure), the cone of rays from the candle tip intersects the image plane in an ellipse,
which for usual imaging geometries is very close to a circle. This is called the circle of confusion for that
point. When every point in the world projects onto a circle of confusion, the image appears to be blurred.
For the image of the candle tip to be sharply focused, it is necessary for the lens to funnel all of the rays
that the aperture allows through from that point onto a single point in the image. This condition is achieved
by changing the focal distance, that is, the distance between the lens and the image plane. By studying the
optics of light diffraction through the lens, it can be shown that the further the point in the world, the shorter
the focal distance must be for sharp focusing. All distances are measured along the optical axis of the lens.
Since the correct focal distance depends on the distance of the world point from the lens, for any fixed
focal distance, only the points on a single plane in the world are in focus. An image plane in position 1 in
the Figure would focus points that are farther away than the candle, and an image plane in position 3 would
focus points that are closer by. The dependence of focus on distance is visible in Figure 2(b): the lens was
focused on the vertical, black and white stripe visible in the image, and the books that are closer are out
of focus. The books that are farther away are out of focus as well, but by a lesser amount, since the effect
of depth is not symmetric around the optimal focusing distance. Photographers say that the lens with the
settings in Figure 2(b) has a shallow (or narrow) depth of field.
The depth of field can be increased, that is, the effects of poor focusing can be reduced, by making
the lens aperture smaller. As a result, the cone of rays that hit the lens from any given point in the world
becomes narrower, the circle of confusion becomes smaller, and the image becomes more sharply focused
everywhere. This can be seen by comparing Figures 2 (b) and (c). Image (b) was taken with the lens aperture
opened at its greatest diameter, resulting in a shallow depth of field. Image (c), on the other hand, was taken
with the aperture closed down as much as possible for the given lens, resulting in a much greater depth of
field: all books are in focus to the human eye. The price paid for a sharper image was exposure time: a small
aperture lets little light through, so the imaging sensor had to be exposed longer to the incoming light: 1/8
of a second for image (b) and 5 seconds, forty times as long, for image (c).
The focal distance at which a given lens focuses objects at infinite distance from the camera is called
the rear focal length of the lens, or focal length for short.3 All distances are measured from the center of
the lens and along the optical axis. Note that the focal length is a lens property, which is usually printed
on the barrel of the lens. In contrast, the focal distance is the distance between lens and image plane that a
photographer selects to place a certain plane of the world in focus. So the focal distance varies even for the
same lens.4
In photography, the aperture is usually measured in stops, or f -numbers. For a focal length f , an aperture
of diameter a is said to have an f -number
n =
,
f
a
so a large aperture has a small f -number. To remind one of this fact, apertures are often denoted with the
notation f /n. For instance, the shallow depth of view image in Figure 2 (b) was obtained with a relatively
wide aperture f /4.2, while the greater depth of field of the image in Figure 2 (c) was achieved with a much
narrower aperture f /29.
Why use a wide aperture at all, if images can be made sharp with a small aperture? As was mentioned
3The front focal length is the converse: the distance to a world object that would be focused on an image plane at infinite distance
4This has nothing to do with zooming. A zoom lens lets you change the focal length as well, that is, modify the optical properties
from the lens.
of the lens.
1.2 Lenses and Discrepancies from the Pinhole Model
5
Figure 3: A shallow depth of field draw attention to what is in focus, at the expense of what is not.
earlier, sharper images are darker, or require longer exposure times. In the example above, the ratio between
the areas of the apertures is (29/4.2)2 ≈ 48. This is more or less consistent with the fact that the sharper
image required forty times the exposure of the blurrier one: 48 times the area means that the lens focuses
48 times as much light on any given small patch on the image, and the exposure time can be decreased
accordingly by a factor of 48. So, wide apertures are required for subjects that move very fast (for instance, in
sports photography). In these cases, long exposure times are not possible, as they would lead to motion blur,
a blur of a different origin (motion in the world) than poor focusing. Wide apertures are often aesthetically
desirable also for static subjects, as they attract attention to what is in focus, at the expense of what is not.
This is illustrated in Figure 3, from
http://www.hp.com/united-states/consumer/digital_photography/take_better_photos/tips/depth.html .
Distortion. Even the high quality lens5 used for the images in Figure 2 exhibits distortion. For instance,
if you place a ruler along the vertical edge of the blue book on the far left of the Figure, you will notice that
the edge is not straight. Curvature is visible also in the top shelf. This is geometric pincushion distortion.
This type of distortion, illustrated in Figure 4(b), moves every point in the image away from the principal
point, by an amount that is proportional to the square of the distance of the point from the principal point.
The reverse type of distortion is called barrel distortion, and draws image points closer to the principal point
by an amount proportional to the square of their distance from it. Because it moves image points towards
or away from the principal point, both types of distortion are called radial. While non-radial distortion does
occur, it is typically negligible in common lenses, and is henceforth ignored.
Distortion can be quite substantial, either by design (such as in non-perspective lenses like fisheye lenses)
or to keep the lens inexpensive and with a wide field of view. Accounting for distortion is crucial in computer
vision algorithms that use cameras as measuring devices, for instance, to reconstruct the three-dimensional
5Nikkor AF-S 18-135 zoom lens, used for both images (b) and (c).
6
2 SENSING
Figure 4: (a) An undistorted grid. (b) The grid in (a) with pincushion distortion. (c) The grid in (a) with
barrel distortion.
shape of objects from two or more images of them.
Practical Aspects: Achieving Low Distortion. Low distortion can be obtained by mounting a lens designed
for a large sensor onto a camera with a smaller sensor. The latter only sees the central portion of the field
of view of the lens, where distortion is usually small.
For instance, lenses for the Nikon D200 used for Figure 2 are designed for a 23.6 by 15.8 millimeter
sensor. Distortion is small but not negligible (see Figure 2 (c)) at the boundaries of the image when a
sensor of this size is used. Distortion would be much smaller if the same lens were mounted onto a
camera with what is called a “1/2 inch” sensor, which is really 6.4 by 4.8 millimeters in size, because the
periphery of the lens would not be used. Lens manufacturers sell relatively inexpensive adaptors for this
purpose. The real price paid for this reduction of distortion is a concomitant reduction of the camera’s
field of view (more on this in the Section on sensing below).
2 Sensing
In a digital camera, still or video, the light that hits the image plane is collected by one or more sensors, that
is, rectangular arrays of sensing elements. Each element is called a pixel (for “picture element”). The finite
overall extent of the sensor array, together with the presence of diaphragms in the lens, limits the cone (or
pyramid) of directions from which light can reach pixels on the sensor. This cone is called the field of view
of the camera-lens combination.
In the 21-st Century, digital cameras have become pervasive in both the consumer and professional
markets as well as in computer vision research. SLR (Single-Lens Reflex) still cameras are the somewhat
bulkier cameras with an internal mirror that lets the photographer view the exact image that the sensor
will see once the shutter button is pressed (hence the name: a single lens with a mirror (reflex)). These have
larger sensors than CCTV cameras have, typically about 24 by 16 millimeters, although some very expensive
models have sensors as large as 36 by 24 millimeters. More modern CCTV cameras are similar to the old
ones, but produce a digital rather than analog signal directly. This signal is transferred to computer through
a digital connection such as USB, or, for high-bandwidth video, IEEE 1394 (also known as Firewire), or a
Gigabit Ethernet connection.
The next Section describes how pixels convert light intensities into voltages, and how these are in turn
converted into numbers within the camera circuitry. This involves processes of integration (of light over the
2.1 Pixels
7
sensitive portion of each pixel), sampling (of the integral over time and at each pixel location), and addition
of noise at all stages. These processes, as well as solutions for recording images in color, are then described
in turn.
2.1 Pixels
A pixel on a digital camera sensor is a small rectangle that contains a photosensitive element and some
circuitry. The photosensitive element is called a photodetector, or light detector. It is a semiconductor junc-
tion placed so that light from the camera lens can reach it. When a photon strikes the junction, it creates
an electron-hole pair with approximately 70 percent probability (this probability is called the quantum ef-
ficiency of the detector). If the junction is part of a polarized electric circuit, the electron moves towards
the positive pole and the hole moves towards the negative pole. This motion constitutes an electric current,
which in turn causes an accumulation of charge (one electron) in a capacitor. A separate circuit discharges
the capacitor at the beginning of the shutter (or exposure) interval. The charge accumulated over this interval
of time is proportional to the amount of light that struck the capacitor during exposure, and therefore to the
brightness of the part of the scene that the lens focuses on the pixel in question. Longer shutter times or
greater image brightness both translate to more accumulated charge, until the capacitor fills up completely
(“saturates”).
Practical Aspects: CCD and CMOS Sensors. Two methods are commonly used in digital cameras to read
these capacitor charges: the CCD and the CMOS active sensor. The Charge-Coupled Device (CCD)
is an electronic, analog shift register, and there is typically one shift register for each column of a CCD
sensor. After the shutter interval has expired, the charges from all the pixels are transferred to the shift
registers of their respective array columns. These registers in turn feed in parallel into a single CCD
register at the bottom of the sensor, which transfers the charges out one row after the other as in a bucket
brigade. The voltage across the output capacitor of this circuitry is proportional to the brightness of the
corresponding pixel. A Digital to Analog (D/A) converter finally amplifies and transforms these voltages
to binary numbers for transmission. In some cameras, the A/D conversion occurs on the camera itself. In
others, a separate circuitry (a frame grabber) is installed for this purpose on a computer that the camera
is connected to.
The photodetector in a CMOS camera works in principle in the same way. However, the photosensitive
junction is fabricated with the standard Complementary-symmetry Metal-Oxide-Semiconductor (CMOS)
technology used to make common integrated circuits such as computer memory and processing units.
Since photodetector and processing circuitry can be fabricated with the same process in CMOS sensors,
the charge-to-voltage conversion that CCD cameras perform serially at the output of the CCD shift register
can be done instead in parallel and locally at every pixel on a CMOS sensor. This is why CMOS arrays
are also called Active Pixel Sensors (APS).
Because of inherent fabrication variations, the first CMOS sensors used to be much less consistent in their
performance, both across different chips and from pixel to pixel on the same chip. This caused the voltage
measured for a constant brightness to vary, thereby producing poor images at the output. However, CMOS
sensor fabrication has improved dramatically in the recent past, and the two classes of sensors are now
comparable to each other in terms of image quality. Although CCDs are still used where consistency of
performance if of prime importance, CMOS sensors are eventually likely to supplant CCDs, both because
of their lower cost and because of the opportunity to add more and more processing to individual pixels.
For instance, “smart” CMOS pixels are being built that adapt their sensitivity to varying light conditions
and do so differently in different parts of the image.
8
2 SENSING
Figure 5: Plot of the normalized gamma correction curve for γ = 1.6.
2.2 A Simple Sensor Model
Not all of the area dedicated to a pixel is necessarily photosensitive, as part of it is occupied by circuitry. The
fraction of pixel area that collects light that can be converted to current is called the pixel’s fill factor, and is
expressed in percent. A 100 percent fill factor is achievable by covering each pixel with a properly shaped
droplet of silica (glass) or silicon on each pixel. This droplet acts as a micro-lens that funnels photons
from the entire pixel area onto the photo-detector. Not all cameras have micro-lenses, nor does a micro-
lens necessarily work effectively on the entire pixel area. So different cameras can have very different fill
factors. In the end, the voltage output from a pixel is the result of integrating light intensity over a pixel area
determined by the fill factor.
The voltage produced is a nonlinear function of brightness. An approximate linearization is typically
performed by a transformation called gamma correction,
Vout = Vmax
(cid:19)1/γ
(cid:18) Vin
Vmax
where Vmax is the maximum possible voltage and γ is a constant. Values of gamma vary, but are typically
between 1.5 and 3, so Vout is a concave function of Vin, as shown in Figure 5: low input voltages are spread
out at the expense of high voltages, thereby increasing the dynamic range6 of the darker parts of the output
image.
Noise affects all stages of the conversion of brightness values to numbers. First, a small current flows
through the photodetectors even if no photons hit its junction. This source of imaging noise is called the
dark current of the sensor. Typically, the dark current cannot be canceled away exactly, because it fluctuates
somewhat and is therefore not entirely predictable. In addition, thermal noise, caused by the agitation of
molecules in the various electronic devices and conductors, is added at all stages of the conversion, with or
without light illuminating the sensor. This type of noise is well modeled by a Gaussian distribution. A third
type of noise is the shot noise that is visible when the levels of exposure are extremely low (but nonzero).
In this situation, each pixel is typically hit by a very small number of photons within the exposure interval.
The fluctuations in the number of photons are then best described by a Poisson distribution.
6Dynamic range: in this context, this is the range of voltages available to express a given range of brightnesses.
0 0.20.40.60.81 0 0.20.40.60.81 (cid:0)(cid:2)(cid:1)(cid:3)(cid:5)(cid:4)(cid:7)(cid:6)(cid:8)(cid:10)(cid:9)(cid:11)(cid:12)(cid:13)(cid:8)(cid:10)(cid:14)(cid:15)(cid:16)