Fiveable

๐Ÿš—Autonomous Vehicle Systems Unit 7 Review

QR code for Autonomous Vehicle Systems practice questions

7.6 Computer vision algorithms

๐Ÿš—Autonomous Vehicle Systems
Unit 7 Review

7.6 Computer vision algorithms

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐Ÿš—Autonomous Vehicle Systems
Unit & Topic Study Guides

Computer vision algorithms are the eyes of autonomous vehicles, enabling them to perceive and interpret their surroundings. These algorithms process visual data, detect objects, estimate depth, and reconstruct 3D scenes, forming the foundation of a vehicle's perception system.

From image processing to object detection, semantic segmentation to visual SLAM, these techniques work together to create a comprehensive understanding of the environment. This knowledge is crucial for navigation, obstacle avoidance, and decision-making in self-driving cars.

Fundamentals of computer vision

  • Computer vision algorithms form the foundation of perception systems in autonomous vehicles, enabling them to interpret and understand their surroundings
  • These fundamental concepts provide the building blocks for more advanced techniques used in object detection, localization, and navigation in self-driving cars

Image representation and processing

  • Digital images represented as 2D arrays of pixel values, typically in RGB color space
  • Image processing techniques include filtering, histogram equalization, and edge detection
  • Grayscale conversion simplifies processing by reducing color information to intensity values
  • Convolution operations apply kernels to images for various effects (blurring, sharpening)

Feature detection and extraction

  • Identifies distinctive elements in images, crucial for object recognition and tracking
  • Corner detection algorithms (Harris, FAST) locate points with high intensity changes
  • Scale-Invariant Feature Transform (SIFT) extracts rotation and scale-invariant features
  • Speeded Up Robust Features (SURF) provides a faster alternative to SIFT
  • Feature descriptors encode local image information for matching and recognition tasks

Image segmentation techniques

  • Divides images into meaningful regions or objects for further analysis
  • Thresholding separates foreground from background based on pixel intensities
  • Region growing groups similar pixels starting from seed points
  • Watershed algorithm treats image as a topographic surface for segmentation
  • Graph-based methods (Graph Cuts) optimize segmentation using graph theory

Object detection and recognition

  • Enables autonomous vehicles to identify and locate objects in their environment, critical for navigation and collision avoidance
  • Combines techniques from image processing, machine learning, and deep learning to achieve robust performance in various conditions

Convolutional neural networks

  • Deep learning architecture designed for processing grid-like data, including images
  • Convolutional layers apply learnable filters to extract hierarchical features
  • Pooling layers reduce spatial dimensions and provide translation invariance
  • Fully connected layers perform high-level reasoning for classification tasks
  • Transfer learning allows pre-trained CNNs to be fine-tuned for specific object detection tasks

Region-based methods

  • R-CNN (Regions with CNN features) proposes regions of interest for object detection
  • Fast R-CNN improves efficiency by sharing computation across proposed regions
  • Faster R-CNN introduces Region Proposal Network (RPN) for end-to-end training
  • Mask R-CNN extends Faster R-CNN to perform instance segmentation
  • Region-based methods excel in accuracy but may have slower inference times

Single-shot detectors

  • Perform object detection in a single forward pass of the network
  • YOLO (You Only Look Once) divides image into grid cells for simultaneous predictions
  • SSD (Single Shot Detector) uses multiple feature maps for detection at different scales
  • RetinaNet addresses class imbalance with focal loss for improved performance
  • Single-shot detectors prioritize speed, making them suitable for real-time applications

Semantic segmentation

  • Assigns a class label to each pixel in an image, crucial for understanding scene layout
  • Enables autonomous vehicles to differentiate between road, sidewalk, vehicles, and pedestrians

Fully convolutional networks

  • Adapts classification networks for dense pixel-wise prediction
  • Replaces fully connected layers with convolutional layers for spatial information preservation
  • Upsampling techniques (transposed convolutions, unpooling) restore spatial resolution
  • Skip connections combine low-level and high-level features for improved segmentation
  • FCN architecture serves as the foundation for many modern semantic segmentation approaches

Encoder-decoder architectures

  • Consists of an encoder network for feature extraction and a decoder for upsampling
  • U-Net introduces skip connections between encoder and decoder for fine-grained segmentation
  • SegNet uses pooling indices for efficient upsampling in the decoder
  • DeepLab employs atrous convolutions for multi-scale context aggregation
  • Pyramid Scene Parsing Network (PSPNet) captures global context through pyramid pooling

Instance vs semantic segmentation

  • Semantic segmentation assigns class labels to pixels without distinguishing individual objects
  • Instance segmentation identifies and separates individual object instances of the same class
  • Mask R-CNN performs instance segmentation by adding a mask prediction branch to Faster R-CNN
  • Panoptic segmentation combines semantic and instance segmentation for complete scene understanding
  • Instance segmentation proves valuable for tracking multiple objects in autonomous driving scenarios

Depth estimation

  • Determines the distance of objects from the camera, essential for 3D scene understanding
  • Enables autonomous vehicles to gauge distances for path planning and obstacle avoidance

Stereo vision algorithms

  • Utilizes two cameras to estimate depth through triangulation
  • Stereo matching finds corresponding points between left and right images
  • Disparity computation measures the pixel offset between corresponding points
  • Semi-Global Matching (SGM) algorithm balances local and global matching costs
  • Stereo vision provides accurate depth estimates but requires careful camera calibration

Monocular depth estimation

  • Estimates depth from a single image using machine learning techniques
  • Supervised learning approaches train on ground truth depth data
  • Self-supervised methods leverage geometric constraints for training
  • Encoder-decoder architectures commonly used for dense depth prediction
  • Monocular methods offer flexibility but may struggle with scale ambiguity

Time-of-flight sensors

  • Active sensing technology measures the time for light to travel to objects and back
  • Emits infrared light pulses and measures the phase shift of returned signals
  • Provides dense depth maps with high frame rates
  • Effective in low-light conditions and for short to medium ranges
  • Complements camera-based depth estimation in autonomous vehicle sensor suites

Optical flow

  • Estimates the motion of pixels between consecutive frames in a video sequence
  • Crucial for motion analysis, object tracking, and ego-motion estimation in autonomous vehicles

Lucas-Kanade method

  • Assumes constant flow in a local neighborhood around each pixel
  • Solves optical flow equations using least squares estimation
  • Suitable for sparse optical flow computation on corner points or features
  • Pyramidal implementation handles larger displacements between frames
  • Computationally efficient but sensitive to illumination changes

Horn-Schunck algorithm

  • Global approach that assumes smoothness of flow field across the entire image
  • Minimizes a global energy function combining data and smoothness terms
  • Iterative solution produces dense optical flow fields
  • Handles smooth variations in flow but struggles with motion discontinuities
  • Provides more comprehensive motion information at the cost of increased computation

Dense vs sparse optical flow

  • Sparse optical flow computes motion for selected points (corners, features)
  • Dense optical flow estimates motion for every pixel in the image
  • Sparse methods (Lucas-Kanade) offer faster computation and robustness to noise
  • Dense methods (Horn-Schunck) provide complete motion fields but are more computationally intensive
  • Hybrid approaches combine sparse and dense techniques for balanced performance

Visual SLAM

  • Simultaneous Localization and Mapping enables autonomous vehicles to build a map of the environment while estimating their position
  • Crucial for navigation in unknown environments and long-term autonomy

Feature-based vs direct methods

  • Feature-based SLAM extracts and tracks distinctive features across frames
  • Direct methods optimize camera pose using raw pixel intensities
  • ORB-SLAM represents a popular feature-based approach with robust performance
  • LSD-SLAM and DSO are examples of direct methods that work on semi-dense or sparse depth maps
  • Feature-based methods offer robustness, while direct methods can work in low-texture environments

Loop closure detection

  • Identifies when the vehicle revisits a previously mapped area
  • Crucial for correcting accumulated drift and maintaining global consistency
  • Appearance-based methods use visual similarity to detect loop closures
  • Geometric verification ensures the validity of potential loop closures
  • Bag-of-Words models and deep learning techniques improve loop closure detection accuracy

Map optimization techniques

  • Refines the estimated map and camera trajectory to minimize errors
  • Bundle adjustment jointly optimizes camera poses and 3D point positions
  • Pose graph optimization focuses on optimizing camera poses using relative constraints
  • Factor graph formulations represent SLAM problems as probabilistic graphical models
  • Incremental and global optimization strategies balance computational efficiency and accuracy

3D reconstruction

  • Creates 3D models of the environment from 2D images or depth data
  • Enables autonomous vehicles to build detailed representations of their surroundings

Structure from motion

  • Reconstructs 3D scenes from multiple images taken from different viewpoints
  • Feature matching and tracking establish correspondences across images
  • Epipolar geometry constrains the search for matching points
  • Incremental SfM builds the reconstruction by adding images sequentially
  • Global SfM methods optimize all camera poses and 3D points simultaneously

Multi-view stereo

  • Densifies sparse 3D reconstructions obtained from SfM
  • Patch-based MVS algorithms estimate oriented patches for each 3D point
  • Volumetric methods discretize space into voxels and optimize occupancy
  • Depth map fusion combines multiple depth estimates for dense reconstruction
  • MVS techniques produce detailed 3D models but can be computationally intensive

Point cloud processing

  • Manages and analyzes 3D point data obtained from reconstruction or depth sensors
  • Registration aligns multiple point clouds into a common coordinate system
  • Filtering removes noise and outliers to improve point cloud quality
  • Downsampling reduces point cloud density for efficient processing
  • Surface reconstruction converts point clouds into mesh or parametric surfaces

Camera calibration

  • Determines the geometric and optical characteristics of cameras used in autonomous vehicles
  • Essential for accurate 3D reconstruction, depth estimation, and multi-camera systems

Intrinsic vs extrinsic parameters

  • Intrinsic parameters describe the camera's internal characteristics (focal length, principal point)
  • Extrinsic parameters define the camera's position and orientation in world coordinates
  • Pinhole camera model represents the basic mathematical framework for calibration
  • Intrinsic calibration uses images of known patterns (checkerboards) to estimate parameters
  • Extrinsic calibration aligns multiple cameras or sensors in a common reference frame

Distortion correction

  • Compensates for lens imperfections that cause image deformations
  • Radial distortion causes straight lines to appear curved (barrel or pincushion effect)
  • Tangential distortion results from misalignment of camera lenses
  • Distortion models (Brown-Conrady) estimate coefficients to correct these effects
  • Undistortion process applies inverse transformations to rectify distorted images

Stereo camera calibration

  • Calibrates a pair of cameras used for stereo vision in autonomous vehicles
  • Determines the relative pose (rotation and translation) between the two cameras
  • Rectification process aligns image planes to simplify stereo matching
  • Epipolar geometry constrains the search for corresponding points to 1D lines
  • Accurate stereo calibration crucial for precise depth estimation and 3D reconstruction

Image enhancement

  • Improves image quality to facilitate better performance of computer vision algorithms
  • Critical for autonomous vehicles operating in challenging lighting and weather conditions

Contrast adjustment

  • Enhances image visibility by optimizing the distribution of pixel intensities
  • Histogram equalization spreads out the most frequent intensity values
  • Contrast Limited Adaptive Histogram Equalization (CLAHE) applies equalization locally
  • Gamma correction adjusts image brightness and contrast using a power-law function
  • Contrast adjustment improves feature detection and object recognition in low-contrast scenes

Noise reduction

  • Removes unwanted variations in pixel intensities caused by sensor imperfections or environmental factors
  • Gaussian filtering smooths images by convolving with a Gaussian kernel
  • Median filtering effectively removes salt-and-pepper noise while preserving edges
  • Non-local means denoising exploits self-similarity in images for high-quality results
  • Bilateral filtering preserves edges while smoothing by considering both spatial and intensity differences

Super-resolution techniques

  • Increases the resolution and quality of low-resolution images
  • Single image super-resolution uses machine learning to infer high-frequency details
  • Multi-frame super-resolution combines information from multiple low-resolution frames
  • Generative Adversarial Networks (GANs) produce realistic high-resolution images
  • Super-resolution enhances the performance of object detection and recognition tasks

Performance evaluation

  • Assesses the effectiveness and efficiency of computer vision algorithms for autonomous vehicles
  • Guides algorithm selection, optimization, and validation for real-world deployment

Accuracy metrics

  • Intersection over Union (IoU) measures the overlap between predicted and ground truth bounding boxes
  • Mean Average Precision (mAP) evaluates object detection performance across multiple classes
  • Pixel accuracy and mean Intersection over Union (mIoU) assess semantic segmentation quality
  • F1 score balances precision and recall for binary classification tasks
  • Confusion matrices provide detailed breakdowns of classification performance

Speed and efficiency measures

  • Frames per second (FPS) quantifies real-time processing capability
  • Floating-point operations (FLOPs) measure computational complexity
  • Memory usage and model size impact deployment on embedded systems
  • Inference time on specific hardware platforms (CPUs, GPUs, TPUs) guides algorithm selection
  • Energy efficiency becomes crucial for battery-powered autonomous vehicles

Benchmarking datasets

  • KITTI dataset provides real-world data for autonomous driving tasks
  • Cityscapes focuses on semantic understanding of urban street scenes
  • nuScenes offers multi-modal sensor data for 3D object detection and tracking
  • Waymo Open Dataset includes high-quality, diverse autonomous driving data
  • BDD100K (Berkeley DeepDrive) covers diverse driving conditions and scenarios