🧠Machine Learning Engineering Unit 11 Review

11.3 Multi-Armed Bandits and Reinforcement Learning

🧠Machine Learning Engineering
Unit 11 Review

11.3 Multi-Armed Bandits and Reinforcement Learning

Written by the Fiveable Content Team • Last updated September 2025

🧠Machine Learning Engineering

Unit & Topic Study Guides

11.1 Experimental Design for ML

11.2 Statistical Analysis of A/B Tests

11.3 Multi-Armed Bandits and Reinforcement Learning

Multi-armed bandits and reinforcement learning tackle the exploration-exploitation dilemma in decision-making. These techniques balance gathering new info with maximizing immediate rewards, crucial for optimizing outcomes in uncertain environments.

From epsilon-greedy to deep Q-networks, these methods power everything from A/B tests to game-playing AIs. They're key to making smart choices when you don't have all the facts, whether you're picking ads or training robots.

Exploration vs Exploitation Trade-off

Fundamental Concepts

Exploration-exploitation trade-off balances gathering new information and maximizing immediate rewards in sequential decision-making
Exploration gathers information about environment or possible actions for better future decisions
Exploitation maximizes immediate rewards based on current knowledge
Trade-off particularly relevant in scenarios with limited resources or time constraints (opportunity cost for each decision)
Mathematical formulations involve probability distributions and expected values of rewards for different actions
Applicable across various domains (machine learning, artificial intelligence, operations research, adaptive control systems)

Strategies and Considerations

Epsilon-greedy methods select best-known action with probability 1-ε and explore randomly with probability ε
Upper confidence bound algorithms maintain confidence intervals for expected reward of each arm
Thompson sampling uses Bayesian approach with probability distributions over expected rewards
Optimal balance varies depending on problem structure, time horizon, and environmental uncertainty
Strategies aim to minimize regret (difference between optimal and actual performance) over time

Multi-armed Bandit Algorithms

Epsilon-Greedy Algorithm

Simple approach for multi-armed bandit problems
Maintains estimates of expected rewards for each arm
Updates estimates based on observed outcomes
Selects best-known action with probability 1-ε and explores randomly with probability ε
Higher ε values promote more exploration
Implementation involves tracking reward estimates and action counts
Example: In online advertising, ε-greedy could select ads with 90% exploiting best-known performer and 10% trying new options

Upper Confidence Bound (UCB) Algorithms

Use optimism in face of uncertainty to balance exploration and exploitation
Maintain confidence intervals for expected reward of each arm
Select arm with highest upper bound
UCB1 algorithm combines empirical mean reward with exploration bonus
UCB1 formula: $\text{UCB1} = \bar{X}_j + \sqrt{\frac{2\ln n}{n_j}}$ $UCB1 = \overset{ˉ}{X}_{j} + \frac{2 l n n}{n _{j}}$
- $\bar{X}_j$: empirical mean reward of arm j
- $n$: total number of pulls
- $n_j$: number of times arm j has been pulled
Automatically adjusts exploration based on uncertainty
Example: In clinical trials, UCB could guide selection of treatments, balancing known efficacy with potential of unexplored options

Thompson Sampling

Bayesian approach for multi-armed bandit problems
Maintains probability distribution over expected rewards of each arm
Samples from these distributions to make decisions
Updates posterior distributions based on observed rewards
Naturally balances exploration and exploitation
Effective in practice, often outperforming simpler methods
Example: In A/B testing for website design, Thompson sampling could dynamically allocate traffic to different versions based on performance uncertainty

Reinforcement Learning Techniques

Q-learning Fundamentals

Model-free reinforcement learning algorithm
Learns action-value function (Q-function) representing expected cumulative reward
Based on Markov Decision Process (MDP) framework
Q-learning update rule: $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_t + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t)]$ $Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t} + γ max_{a} Q (s_{t + 1}, a) - Q (s_{t}, a_{t})]$
- $\alpha$: learning rate
- $\gamma$: discount factor for future rewards
Iteratively updates Q-values based on observed rewards and maximum Q-value of next state
Handles environments with discrete state and action spaces
Example: Q-learning applied to game playing (Tic-Tac-Toe) learns optimal moves through repeated play

Policy Gradient Methods

Directly optimize policy (mapping from states to actions)
Use gradient ascent on expected cumulative reward
Useful for continuous action spaces and high-dimensional state spaces
REINFORCE algorithm uses Monte Carlo sampling to estimate policy gradients
Policy gradient theorem forms basis for many algorithms: $\nabla_\theta J(\theta) = E_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s,a)]$
Can incorporate function approximation (neural networks) for complex state spaces
Example: Policy gradients applied to robot control tasks learn smooth, continuous actions for navigation or manipulation

Deep Reinforcement Learning

Combines RL algorithms with deep neural networks
Handles complex, high-dimensional state spaces (images, sensor data)
Deep Q-Network (DQN) uses convolutional neural networks for Q-function approximation
Actor-Critic methods separate policy (actor) and value function (critic) learning
Proximal Policy Optimization (PPO) improves stability of policy gradient methods
Addresses challenges of sparse rewards and long-term credit assignment
Example: DeepMind's AlphaGo used deep RL to master the game of Go, defeating world champions

Algorithm Performance Evaluation

Evaluation Metrics

Cumulative regret measures total loss compared to optimal strategy over time
Simple regret focuses on quality of final recommendation or decision
Best arm identification rate assesses ability to find optimal action
Average return and discounted cumulative reward evaluate overall performance in RL
Learning speed (sample efficiency) measures how quickly algorithms improve
Online performance evaluates adaptation during learning process
Offline performance assesses generalization after learning completes

Real-world Applications and Challenges

A/B testing in online advertising and recommendation systems uses multi-armed bandits
Reinforcement learning applied in robotics, game playing, and resource management
Non-stationarity introduces time-varying rewards or state transitions
Partial observability limits access to complete state information
High-dimensional state spaces require efficient function approximation
Safety considerations crucial in physical systems (robotics, autonomous vehicles)
Scalability to large state/action spaces needed for practical applications
Example: Recommender systems use bandits to balance exploring new content and exploiting known user preferences

Robustness and Deployment Considerations

Algorithms must adapt to environmental changes in real-world scenarios
Evaluate performance across different initial conditions and random seeds
Consider computational requirements for real-time decision-making
Assess data efficiency to minimize costly interactions with environment
Balance exploration and exploitation in production systems
Implement safeguards against unexpected or adversarial inputs
Continuously monitor and update models in deployed systems
Example: Self-driving car algorithms must robustly handle diverse traffic scenarios and weather conditions

🧠Machine Learning Engineering Unit 11 Review

11.3 Multi-Armed Bandits and Reinforcement Learning

🧠Machine Learning Engineering Unit 11 Review

11.3 Multi-Armed Bandits and Reinforcement Learning

Unit & Topic Study Guides

Exploration vs Exploitation Trade-off

Fundamental Concepts

Strategies and Considerations

Multi-armed Bandit Algorithms

Epsilon-Greedy Algorithm

Upper Confidence Bound (UCB) Algorithms

Thompson Sampling

Reinforcement Learning Techniques

Q-learning Fundamentals

Policy Gradient Methods

Deep Reinforcement Learning

Algorithm Performance Evaluation

Evaluation Metrics

Real-world Applications and Challenges

Robustness and Deployment Considerations

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

🧠Machine Learning Engineering
Unit 11 Review