🚗Autonomous Vehicle Systems Unit 7 Review

7.6 Computer vision algorithms

🚗Autonomous Vehicle Systems
Unit 7 Review

7.6 Computer vision algorithms

Written by the Fiveable Content Team • Last updated September 2025

🚗Autonomous Vehicle Systems

Unit & Topic Study Guides

7.1 Supervised learning

7.2 Unsupervised learning

7.3 Reinforcement learning

7.4 Deep learning

7.5 Neural networks

7.6 Computer vision algorithms

Computer vision algorithms are the eyes of autonomous vehicles, enabling them to perceive and interpret their surroundings. These algorithms process visual data, detect objects, estimate depth, and reconstruct 3D scenes, forming the foundation of a vehicle's perception system.

From image processing to object detection, semantic segmentation to visual SLAM, these techniques work together to create a comprehensive understanding of the environment. This knowledge is crucial for navigation, obstacle avoidance, and decision-making in self-driving cars.

Fundamentals of computer vision

Computer vision algorithms form the foundation of perception systems in autonomous vehicles, enabling them to interpret and understand their surroundings
These fundamental concepts provide the building blocks for more advanced techniques used in object detection, localization, and navigation in self-driving cars

Image representation and processing

Digital images represented as 2D arrays of pixel values, typically in RGB color space
Image processing techniques include filtering, histogram equalization, and edge detection
Grayscale conversion simplifies processing by reducing color information to intensity values
Convolution operations apply kernels to images for various effects (blurring, sharpening)

Feature detection and extraction

Identifies distinctive elements in images, crucial for object recognition and tracking
Corner detection algorithms (Harris, FAST) locate points with high intensity changes
Scale-Invariant Feature Transform (SIFT) extracts rotation and scale-invariant features
Speeded Up Robust Features (SURF) provides a faster alternative to SIFT
Feature descriptors encode local image information for matching and recognition tasks

Image segmentation techniques

Divides images into meaningful regions or objects for further analysis
Thresholding separates foreground from background based on pixel intensities
Region growing groups similar pixels starting from seed points
Watershed algorithm treats image as a topographic surface for segmentation
Graph-based methods (Graph Cuts) optimize segmentation using graph theory

Object detection and recognition

Enables autonomous vehicles to identify and locate objects in their environment, critical for navigation and collision avoidance
Combines techniques from image processing, machine learning, and deep learning to achieve robust performance in various conditions

Convolutional neural networks

Deep learning architecture designed for processing grid-like data, including images
Convolutional layers apply learnable filters to extract hierarchical features
Pooling layers reduce spatial dimensions and provide translation invariance
Fully connected layers perform high-level reasoning for classification tasks
Transfer learning allows pre-trained CNNs to be fine-tuned for specific object detection tasks

Region-based methods

R-CNN (Regions with CNN features) proposes regions of interest for object detection
Fast R-CNN improves efficiency by sharing computation across proposed regions
Faster R-CNN introduces Region Proposal Network (RPN) for end-to-end training
Mask R-CNN extends Faster R-CNN to perform instance segmentation
Region-based methods excel in accuracy but may have slower inference times

Single-shot detectors

Perform object detection in a single forward pass of the network
YOLO (You Only Look Once) divides image into grid cells for simultaneous predictions
SSD (Single Shot Detector) uses multiple feature maps for detection at different scales
RetinaNet addresses class imbalance with focal loss for improved performance
Single-shot detectors prioritize speed, making them suitable for real-time applications

Semantic segmentation

Assigns a class label to each pixel in an image, crucial for understanding scene layout
Enables autonomous vehicles to differentiate between road, sidewalk, vehicles, and pedestrians

Fully convolutional networks

Adapts classification networks for dense pixel-wise prediction
Replaces fully connected layers with convolutional layers for spatial information preservation
Upsampling techniques (transposed convolutions, unpooling) restore spatial resolution
Skip connections combine low-level and high-level features for improved segmentation
FCN architecture serves as the foundation for many modern semantic segmentation approaches

Encoder-decoder architectures

Consists of an encoder network for feature extraction and a decoder for upsampling
U-Net introduces skip connections between encoder and decoder for fine-grained segmentation
SegNet uses pooling indices for efficient upsampling in the decoder
DeepLab employs atrous convolutions for multi-scale context aggregation
Pyramid Scene Parsing Network (PSPNet) captures global context through pyramid pooling

Instance vs semantic segmentation

Semantic segmentation assigns class labels to pixels without distinguishing individual objects
Instance segmentation identifies and separates individual object instances of the same class
Mask R-CNN performs instance segmentation by adding a mask prediction branch to Faster R-CNN
Panoptic segmentation combines semantic and instance segmentation for complete scene understanding
Instance segmentation proves valuable for tracking multiple objects in autonomous driving scenarios

Depth estimation

Determines the distance of objects from the camera, essential for 3D scene understanding
Enables autonomous vehicles to gauge distances for path planning and obstacle avoidance

Stereo vision algorithms

Utilizes two cameras to estimate depth through triangulation
Stereo matching finds corresponding points between left and right images
Disparity computation measures the pixel offset between corresponding points
Semi-Global Matching (SGM) algorithm balances local and global matching costs
Stereo vision provides accurate depth estimates but requires careful camera calibration

Monocular depth estimation

Estimates depth from a single image using machine learning techniques
Supervised learning approaches train on ground truth depth data
Self-supervised methods leverage geometric constraints for training
Encoder-decoder architectures commonly used for dense depth prediction
Monocular methods offer flexibility but may struggle with scale ambiguity

Time-of-flight sensors

Active sensing technology measures the time for light to travel to objects and back
Emits infrared light pulses and measures the phase shift of returned signals
Provides dense depth maps with high frame rates
Effective in low-light conditions and for short to medium ranges
Complements camera-based depth estimation in autonomous vehicle sensor suites

Optical flow

Estimates the motion of pixels between consecutive frames in a video sequence
Crucial for motion analysis, object tracking, and ego-motion estimation in autonomous vehicles

Lucas-Kanade method

Assumes constant flow in a local neighborhood around each pixel
Solves optical flow equations using least squares estimation
Suitable for sparse optical flow computation on corner points or features
Pyramidal implementation handles larger displacements between frames
Computationally efficient but sensitive to illumination changes

Horn-Schunck algorithm

Global approach that assumes smoothness of flow field across the entire image
Minimizes a global energy function combining data and smoothness terms
Iterative solution produces dense optical flow fields
Handles smooth variations in flow but struggles with motion discontinuities
Provides more comprehensive motion information at the cost of increased computation

Dense vs sparse optical flow

Sparse optical flow computes motion for selected points (corners, features)
Dense optical flow estimates motion for every pixel in the image
Sparse methods (Lucas-Kanade) offer faster computation and robustness to noise
Dense methods (Horn-Schunck) provide complete motion fields but are more computationally intensive
Hybrid approaches combine sparse and dense techniques for balanced performance

Visual SLAM

Simultaneous Localization and Mapping enables autonomous vehicles to build a map of the environment while estimating their position
Crucial for navigation in unknown environments and long-term autonomy

Feature-based vs direct methods

Feature-based SLAM extracts and tracks distinctive features across frames
Direct methods optimize camera pose using raw pixel intensities
ORB-SLAM represents a popular feature-based approach with robust performance
LSD-SLAM and DSO are examples of direct methods that work on semi-dense or sparse depth maps
Feature-based methods offer robustness, while direct methods can work in low-texture environments

Loop closure detection

Identifies when the vehicle revisits a previously mapped area
Crucial for correcting accumulated drift and maintaining global consistency
Appearance-based methods use visual similarity to detect loop closures
Geometric verification ensures the validity of potential loop closures
Bag-of-Words models and deep learning techniques improve loop closure detection accuracy

Map optimization techniques

Refines the estimated map and camera trajectory to minimize errors
Bundle adjustment jointly optimizes camera poses and 3D point positions
Pose graph optimization focuses on optimizing camera poses using relative constraints
Factor graph formulations represent SLAM problems as probabilistic graphical models
Incremental and global optimization strategies balance computational efficiency and accuracy

3D reconstruction

Creates 3D models of the environment from 2D images or depth data
Enables autonomous vehicles to build detailed representations of their surroundings

Structure from motion

Reconstructs 3D scenes from multiple images taken from different viewpoints
Feature matching and tracking establish correspondences across images
Epipolar geometry constrains the search for matching points
Incremental SfM builds the reconstruction by adding images sequentially
Global SfM methods optimize all camera poses and 3D points simultaneously

Multi-view stereo

Densifies sparse 3D reconstructions obtained from SfM
Patch-based MVS algorithms estimate oriented patches for each 3D point
Volumetric methods discretize space into voxels and optimize occupancy
Depth map fusion combines multiple depth estimates for dense reconstruction
MVS techniques produce detailed 3D models but can be computationally intensive

Point cloud processing

Manages and analyzes 3D point data obtained from reconstruction or depth sensors
Registration aligns multiple point clouds into a common coordinate system
Filtering removes noise and outliers to improve point cloud quality
Downsampling reduces point cloud density for efficient processing
Surface reconstruction converts point clouds into mesh or parametric surfaces

Camera calibration

Determines the geometric and optical characteristics of cameras used in autonomous vehicles
Essential for accurate 3D reconstruction, depth estimation, and multi-camera systems

Intrinsic vs extrinsic parameters

Intrinsic parameters describe the camera's internal characteristics (focal length, principal point)
Extrinsic parameters define the camera's position and orientation in world coordinates
Pinhole camera model represents the basic mathematical framework for calibration
Intrinsic calibration uses images of known patterns (checkerboards) to estimate parameters
Extrinsic calibration aligns multiple cameras or sensors in a common reference frame

Distortion correction

Compensates for lens imperfections that cause image deformations
Radial distortion causes straight lines to appear curved (barrel or pincushion effect)
Tangential distortion results from misalignment of camera lenses
Distortion models (Brown-Conrady) estimate coefficients to correct these effects
Undistortion process applies inverse transformations to rectify distorted images

Stereo camera calibration

Calibrates a pair of cameras used for stereo vision in autonomous vehicles
Determines the relative pose (rotation and translation) between the two cameras
Rectification process aligns image planes to simplify stereo matching
Epipolar geometry constrains the search for corresponding points to 1D lines
Accurate stereo calibration crucial for precise depth estimation and 3D reconstruction

Image enhancement

Improves image quality to facilitate better performance of computer vision algorithms
Critical for autonomous vehicles operating in challenging lighting and weather conditions

Contrast adjustment

Enhances image visibility by optimizing the distribution of pixel intensities
Histogram equalization spreads out the most frequent intensity values
Contrast Limited Adaptive Histogram Equalization (CLAHE) applies equalization locally
Gamma correction adjusts image brightness and contrast using a power-law function
Contrast adjustment improves feature detection and object recognition in low-contrast scenes

Noise reduction

Removes unwanted variations in pixel intensities caused by sensor imperfections or environmental factors
Gaussian filtering smooths images by convolving with a Gaussian kernel
Median filtering effectively removes salt-and-pepper noise while preserving edges
Non-local means denoising exploits self-similarity in images for high-quality results
Bilateral filtering preserves edges while smoothing by considering both spatial and intensity differences

Super-resolution techniques

Increases the resolution and quality of low-resolution images
Single image super-resolution uses machine learning to infer high-frequency details
Multi-frame super-resolution combines information from multiple low-resolution frames
Generative Adversarial Networks (GANs) produce realistic high-resolution images
Super-resolution enhances the performance of object detection and recognition tasks

Performance evaluation

Assesses the effectiveness and efficiency of computer vision algorithms for autonomous vehicles
Guides algorithm selection, optimization, and validation for real-world deployment

Accuracy metrics

Intersection over Union (IoU) measures the overlap between predicted and ground truth bounding boxes
Mean Average Precision (mAP) evaluates object detection performance across multiple classes
Pixel accuracy and mean Intersection over Union (mIoU) assess semantic segmentation quality
F1 score balances precision and recall for binary classification tasks
Confusion matrices provide detailed breakdowns of classification performance

Speed and efficiency measures

Frames per second (FPS) quantifies real-time processing capability
Floating-point operations (FLOPs) measure computational complexity
Memory usage and model size impact deployment on embedded systems
Inference time on specific hardware platforms (CPUs, GPUs, TPUs) guides algorithm selection
Energy efficiency becomes crucial for battery-powered autonomous vehicles

Benchmarking datasets

KITTI dataset provides real-world data for autonomous driving tasks
Cityscapes focuses on semantic understanding of urban street scenes
nuScenes offers multi-modal sensor data for 3D object detection and tracking
Waymo Open Dataset includes high-quality, diverse autonomous driving data
BDD100K (Berkeley DeepDrive) covers diverse driving conditions and scenarios

🚗Autonomous Vehicle Systems Unit 7 Review

7.6 Computer vision algorithms

🚗Autonomous Vehicle Systems Unit 7 Review

7.6 Computer vision algorithms

Unit & Topic Study Guides

Fundamentals of computer vision

Image representation and processing

Feature detection and extraction

Image segmentation techniques

Object detection and recognition

Convolutional neural networks

Region-based methods

Single-shot detectors

Semantic segmentation

Fully convolutional networks

Encoder-decoder architectures

Instance vs semantic segmentation

Depth estimation

Stereo vision algorithms

Monocular depth estimation

Time-of-flight sensors

Optical flow

Lucas-Kanade method

Horn-Schunck algorithm

Dense vs sparse optical flow

Visual SLAM

Feature-based vs direct methods

Loop closure detection

Map optimization techniques

3D reconstruction

Structure from motion

Multi-view stereo

Point cloud processing

Camera calibration

Intrinsic vs extrinsic parameters

Distortion correction

Stereo camera calibration

Image enhancement

Contrast adjustment

Noise reduction

Super-resolution techniques

Performance evaluation

Accuracy metrics

Speed and efficiency measures

Benchmarking datasets

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

🚗Autonomous Vehicle Systems
Unit 7 Review