Object Detection KPIs: Why slanted edge is not good enough.

Timofey Uvarov
Jan 8
8 min read

Updated: Apr 20

Tech.AD presentation

The presentation covered 4 main topics:

1. Key image quality indexes were measured for an automotive camera system

2. Perceptual analysis of 2mpx and 8mpx sensors to determine detectability of objects at different distances.

3. Infrastructure to support an 8mpx camera with the ability to work with RAW images.

4. Introduction to ISP pipeline we are working on with our partners.

After the presentation, I was asked several questions on image quality metrics, so I wrote this article to follow up.

Object detection capability primarily depends on detail reproduction and dynamic range reproduction. In this article I’d like to do a deep dive on detail reproduction and later to devote a special article to dynamic range and explain how to understand it.

Detail Reproduction

Detail reproduction is the ability of the camera system to reproduce and render the finite small details so that the “detector” can reliably identify and classify them. The detector could be you, an AI algorithm, or a target group, such as QA and labeling engineers, with a determined process to form a consolidated opinion.

Detail reproduction is based on information theory and measures how effectively a camera uses its optics, photodetectors, and the rest of the ISP pipeline. In other words, detail reproduction tells us what the smallest detail(object) we can capture and render properly at a given distance is.

This ability depends on each aspect of the image acquisition process, such as lighting conditions, camera lens properties, sensor color filter and micro lens, pixel architecture, and all ISP blocks, such as bad pixel replacement, demosaicing, sharpening, noise reduction, gamma correction, tone-mapping, contrast enhancement, and others.

To illustrate how important detail reproduction is, we can imagine 2-megapixel and 8-megapixel sensors of the same physical size so that each pixel of the 2mpx sensor is replaced with a 2x2 block of pixels on the 8mpx sensor.

The 8mpx sensor captures 4x more pixels during each readout, but how much more information and how much more useful information does it really carry? For example, if the lens on 8mpx is slightly out of focus or if the cross-talk between two neighboring pixels on the 8mpx sensor is high, significant blurring can occur.

the right image has 4x more pixels but same amount of information as the center image

The top convolutional layers of a neural network consume a lot of resources, so it's in our best interest to design our camera system so that it effectively uses its pixel elements.

ISO Charts used in conventional imaging

ISO 12233 chart for measuring detail reproduction (Imatest)

In conventional imaging, detail reproduction is measured using charts such as the ISO 12233 chart.

The so-called wedges represent thin lines that get closer to each other as they get thinner. The further we can look into the wedge and distinguish separate lines, the higher the detail of the camera system's preproduction is.

example of the same RAW image processed with different demosaicing algorithms

In the illustration above, the same RAW image was processed using two different demosaicing algorithms. In the top processed result, we can distinguish clear lines to the mark of about 16.5, and at the bottom image, the mark is at about 18. The bottom result has greater detail reproduction than the top one.

ISP demosaicing process

However, with modern cameras used for object detection, ISO 12233 and its wedges are not the best method to determine detail reproduction. We must look into the image sensor pixel architecture and the demosaicing process in ISP to understand why wedges are unreliable metrics for object detection.

In modern RGB cameras, the image sensor has a mosaiced pattern, such as a Bayer pixel array or similar, so, at each physical pixel location, only one color component is detected. To find the other two color components, a color interpolation or demosaicing process in ISP is used, where the missing color components are reconstructed using neighboring pixel values.

Demosaicing is used to reconstruct the tiniest details in the image and is usually optimized by the ISP vendor to show the highest score on such wedges as described above. However, since the wedges in the chart are just thin lines that are vertical or horizontal, the wedge scores do not always translate into real-life scenarios with objects like cars and pedestrians, which have all types and angles of edges, corners, and patterns.

During its evolution, the ISP demosaicing process became sufficiently sophisticated to detect such patterns as wedges with very high certainty using the method of edge-directed interpolation, where the missing color values are interpolated either horizontally or vertically.

For simplicity of explanation, let's look into the demosaicing and the reconstruction of the green color channel below:

Left: Green color plane in Bayer RAW image, Right: Green color plane in RGB image

In the image above, we can see how white squares on the left image, where the green value was initially unknown, are becoming replaced (populated) with green on the right image after color interpolation.

Let’s illustrate that process by looking into any random white square:

Green value at the Gx location has to be found

In a linear camera system, we would calculate the missing value as Gx = (G1+G2+G3+G4)/4

In the edge-directed interpolation mentioned above, the Gx value is determined based on the relation between changes in horizontal and vertical direction. First, horizontal and vertical classifiers are calculated:

GradH=abs(G4-G2,), GradV=abs(G1-G3)

Then horizontal interpolation value and vertical interpolation values are calculated: Gh=(G4+G2)/2, Gv=(G1+G3)/2

The final product is interpolated as Gx = Gv*Wv+Gh*Wh, where Wv and Wh are horizontal and vertical weights calculated as functions of GradH and GradV. Thus, a larger weight is assigned to the direction where less change occurs.

The example below is a vertical dark line on a white background.

Gh=(200+220)/2, Gv=(10+10)/2, GradH=20, GradV=0.

Since GradV=0 and GradH>0 our weight function will force Wh=0,Wv=1, so Gx=Gv*1+Gh*0 or Gx=Gv=(G1+G3)/2=(10+10)/2=10, which restores thin vertical line properly.

If linear interpolation is used or the direction is detected incorrectly due to noise or aliasing, a solid edge or line in the real world after the demosaicing will be reproduced as a zig-zagged or saw-like edge, as depicted below.

Improperly reconstructed (demosaiced) edge leads to wrong pattern and object detection.

In the 20 years, ISP engineers have been challenged with detail reconstruction and have learned how to score well with the wedges and solid edges used for MTF, such as in the 12223 chart, using an advanced version of edge-directed interpolation and methods based on multi-scale self-similarity (non-local means, bm3d, and such), which are too complex to describe in this article.

It is important to understand that different demosaicing algorithms produce drastically different results from the point of view of detail reproduction, and low-quality demosaicing algorithms can often be the bottleneck in computer vision pipelines and require overkill size networks to compensate for all artifacts of incorrect pattern reconstruction.

Human Vision charts

As modern ISPs learned to outsmart charts like ISO12233, it is suggested that human vision charts are most efficient way to measure detail reproduction of a camera. Such charts consist of numbered lines of text; each next line is smaller than the previous one. We place the camera 10ft or 20ft away from the chart, capture the image, and determine which line can be reliably read in the captured image. Vision chart characters have much more complex skeleton (structure) than horizontal, vertical, or slanted edges, so none of the existing ISP processors has the intelligence to reconstruct them based on just edge continuity or self-similarity presumption. AI algorithm could potentially remember the whole chart and render it from memory, but none of the existing ISPs have such ability.

After capturing the vision chart from 10ft (20ft) away, we determine the smallest line we can read without confusion and look for the score to the right (left) of it.

This is a fascinating presentation by Brian Wandell from Standford University about how humans are learning to read:

Sharpness, contrast, and MTF are not equal to detail reproduction!

It is important to mention here that metrics such as contrast, sharpness, or MTF do not directly translate into detail reproduction.

In the example above Image A has higher detail reproduction than Image B and Image C, but both Image B and Image C are sharper (have higher local contrast) than Image A, and Image C has higher contrast globally than both images B and A.

Detail reproduction indeed depends on the MTF as a function of the lens and sensor combo. So, when we explore detail reproduction, we place the chart in different ROIs and also explore the MTF map for minimums and consistency. MTF is a good measurement for RAW images.

Detail reproduction of automotive cameras

Below is the vision chart captured with a 2mpx sensor and using 30,60

and 120 horizontal FOV angle lenses. The green underline is our vision score and indicates which is the bottom line we can read without confusion, the yellow line — is where we have some confusion, and the red — is where we can not read anything at all.

Left: 30D fov, Center: 60D fov, right: 120D fov

In the example below, we kept the lens the same and replaced the sensor from 2mpx to 8mpx, and compared the vision scores:

As we can observe, the 8mpx camera has a score of 1.25 vs. 0.67 vision score for a 2mpx camera. The difference in detail reproduction can be expressed in 3 lines of difference:

Later, we learned how vision scores would extrapolate to object detection capabilities when we did a study to detect a pedestrian placing those cameras side by side on a runway using a human mannequin as a target.

rendering of pedestrian at different distances with 2mpx and 8mpx sensor and same lens of 30D fov

After collecting consolidated human confidence from a target group of our labeling team we formed the distance/vision trend for each camera and found out that at the range of 300–500 meters on average 8mpx camera image produces 2x of consolidated human confidence compared to 2mpx camera, which is very close to to our vision scores of 1.25 and 0.67.

The visualization of the vision trend displayed below also narrates that with an 8mpx camera, we can consistently reach the same solid confidence higher than 0.5 with around 70m of advantage compared to a 2mpx camera.

2mpx and 8mpx cameras and vision/distance trends for human detection

iPhone 12 detail reproduction

As the iPhone camera is available to everyone for reference, we are also providing vision scores for each of 3 iPhone 12 pro front-facing cameras from 10ft:

Let's see how the vision scores of each iPhone camera translate into representing a monumental object located near the Pony.AI parking lot 225m away from the camera.

Each 2.0x is the original resolution, and 1.0x and 0.5x are upscaled and aligned.

Zoomed 1:1, it is still impossible to see the difference.

At 3.0x zoom, the 0.5x image starts to look unnatural with excessive step overshoot

At 7.0x, a similar problem happens with 1.0x

Digital zoom: 10x, sharpening at 1.x increases aliasing or a smooth curve

Street sign with 3 cameras of iPhone 12pro

Small details of tree branches of with iPhone 12 pro

Since car wheel spoke is can be considered a thin line, I loaded the image in ImageJ software and displayed a 1-d horizontal cross-section where the height represents the luminance value at each horizontal location:

Horizontal cross-section of a single spoke on each camera.

Looking at the pixel-wise horizontal cross-section of the same feature through each lens, we can see that on a 2.0x image, the spoke occupies 3 pixels, so at least a 5-pixel convolutional layer would be required to detect it, while at a 0.5x image, at least a 7-pixel-wide kernel would be required.

Let’s look at other finest details and build another cross-section:

Horizontal cross-section of tree branch.

As we can observe from the above cross-section, excessive sharpening applied to a 2.0x image results in 2-spikes in the peak, which is represented by a smooth Gaussian normally distributed peak in the 0.5x version of the same detail, which will require a totally different convolutional kernel to detect. Ideally, we want to minimize such variation in the representation of the same details to reduce the number of heavy convolutional layers at the top of our network. For that, we only use sharpening applied to frequencies that are larger than the size of the convolutional kernel of our first layer in the network. How to do frequency-based spectral decomposition of the signal will be described in another publication.

At last, I’d like to show how much data is used to represent a small piece of a tree branch, which is less than 0.01 of a single image.

3d representation of a small tree branch captured with different lenses

Object Detection KPIs: Why slanted edge is not good enough.

Recent Posts

Newsletter