Fiveable

๐Ÿง Machine Learning Engineering Unit 11 Review

QR code for Machine Learning Engineering practice questions

11.3 Multi-Armed Bandits and Reinforcement Learning

๐Ÿง Machine Learning Engineering
Unit 11 Review

11.3 Multi-Armed Bandits and Reinforcement Learning

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐Ÿง Machine Learning Engineering
Unit & Topic Study Guides

Multi-armed bandits and reinforcement learning tackle the exploration-exploitation dilemma in decision-making. These techniques balance gathering new info with maximizing immediate rewards, crucial for optimizing outcomes in uncertain environments.

From epsilon-greedy to deep Q-networks, these methods power everything from A/B tests to game-playing AIs. They're key to making smart choices when you don't have all the facts, whether you're picking ads or training robots.

Exploration vs Exploitation Trade-off

Fundamental Concepts

  • Exploration-exploitation trade-off balances gathering new information and maximizing immediate rewards in sequential decision-making
  • Exploration gathers information about environment or possible actions for better future decisions
  • Exploitation maximizes immediate rewards based on current knowledge
  • Trade-off particularly relevant in scenarios with limited resources or time constraints (opportunity cost for each decision)
  • Mathematical formulations involve probability distributions and expected values of rewards for different actions
  • Applicable across various domains (machine learning, artificial intelligence, operations research, adaptive control systems)

Strategies and Considerations

  • Epsilon-greedy methods select best-known action with probability 1-ฮต and explore randomly with probability ฮต
  • Upper confidence bound algorithms maintain confidence intervals for expected reward of each arm
  • Thompson sampling uses Bayesian approach with probability distributions over expected rewards
  • Optimal balance varies depending on problem structure, time horizon, and environmental uncertainty
  • Strategies aim to minimize regret (difference between optimal and actual performance) over time

Multi-armed Bandit Algorithms

Epsilon-Greedy Algorithm

  • Simple approach for multi-armed bandit problems
  • Maintains estimates of expected rewards for each arm
  • Updates estimates based on observed outcomes
  • Selects best-known action with probability 1-ฮต and explores randomly with probability ฮต
  • Higher ฮต values promote more exploration
  • Implementation involves tracking reward estimates and action counts
  • Example: In online advertising, ฮต-greedy could select ads with 90% exploiting best-known performer and 10% trying new options

Upper Confidence Bound (UCB) Algorithms

  • Use optimism in face of uncertainty to balance exploration and exploitation
  • Maintain confidence intervals for expected reward of each arm
  • Select arm with highest upper bound
  • UCB1 algorithm combines empirical mean reward with exploration bonus
  • UCB1 formula: UCB1=Xห‰j+2lnโกnnj\text{UCB1} = \bar{X}_j + \sqrt{\frac{2\ln n}{n_j}}
    • $\bar{X}_j$: empirical mean reward of arm j
    • $n$: total number of pulls
    • $n_j$: number of times arm j has been pulled
  • Automatically adjusts exploration based on uncertainty
  • Example: In clinical trials, UCB could guide selection of treatments, balancing known efficacy with potential of unexplored options

Thompson Sampling

  • Bayesian approach for multi-armed bandit problems
  • Maintains probability distribution over expected rewards of each arm
  • Samples from these distributions to make decisions
  • Updates posterior distributions based on observed rewards
  • Naturally balances exploration and exploitation
  • Effective in practice, often outperforming simpler methods
  • Example: In A/B testing for website design, Thompson sampling could dynamically allocate traffic to different versions based on performance uncertainty

Reinforcement Learning Techniques

Q-learning Fundamentals

  • Model-free reinforcement learning algorithm
  • Learns action-value function (Q-function) representing expected cumulative reward
  • Based on Markov Decision Process (MDP) framework
  • Q-learning update rule: Q(st,at)โ†Q(st,at)+ฮฑ[rt+ฮณmaxโกaQ(st+1,a)โˆ’Q(st,at)]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_t + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t)]
    • $\alpha$: learning rate
    • $\gamma$: discount factor for future rewards
  • Iteratively updates Q-values based on observed rewards and maximum Q-value of next state
  • Handles environments with discrete state and action spaces
  • Example: Q-learning applied to game playing (Tic-Tac-Toe) learns optimal moves through repeated play

Policy Gradient Methods

  • Directly optimize policy (mapping from states to actions)
  • Use gradient ascent on expected cumulative reward
  • Useful for continuous action spaces and high-dimensional state spaces
  • REINFORCE algorithm uses Monte Carlo sampling to estimate policy gradients
  • Policy gradient theorem forms basis for many algorithms: โˆ‡ฮธJ(ฮธ)=Eฯ€ฮธ[โˆ‡ฮธlogโกฯ€ฮธ(aโˆฃs)Qฯ€ฮธ(s,a)]\nabla_\theta J(\theta) = E_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s,a)]
  • Can incorporate function approximation (neural networks) for complex state spaces
  • Example: Policy gradients applied to robot control tasks learn smooth, continuous actions for navigation or manipulation

Deep Reinforcement Learning

  • Combines RL algorithms with deep neural networks
  • Handles complex, high-dimensional state spaces (images, sensor data)
  • Deep Q-Network (DQN) uses convolutional neural networks for Q-function approximation
  • Actor-Critic methods separate policy (actor) and value function (critic) learning
  • Proximal Policy Optimization (PPO) improves stability of policy gradient methods
  • Addresses challenges of sparse rewards and long-term credit assignment
  • Example: DeepMind's AlphaGo used deep RL to master the game of Go, defeating world champions

Algorithm Performance Evaluation

Evaluation Metrics

  • Cumulative regret measures total loss compared to optimal strategy over time
  • Simple regret focuses on quality of final recommendation or decision
  • Best arm identification rate assesses ability to find optimal action
  • Average return and discounted cumulative reward evaluate overall performance in RL
  • Learning speed (sample efficiency) measures how quickly algorithms improve
  • Online performance evaluates adaptation during learning process
  • Offline performance assesses generalization after learning completes

Real-world Applications and Challenges

  • A/B testing in online advertising and recommendation systems uses multi-armed bandits
  • Reinforcement learning applied in robotics, game playing, and resource management
  • Non-stationarity introduces time-varying rewards or state transitions
  • Partial observability limits access to complete state information
  • High-dimensional state spaces require efficient function approximation
  • Safety considerations crucial in physical systems (robotics, autonomous vehicles)
  • Scalability to large state/action spaces needed for practical applications
  • Example: Recommender systems use bandits to balance exploring new content and exploiting known user preferences

Robustness and Deployment Considerations

  • Algorithms must adapt to environmental changes in real-world scenarios
  • Evaluate performance across different initial conditions and random seeds
  • Consider computational requirements for real-time decision-making
  • Assess data efficiency to minimize costly interactions with environment
  • Balance exploration and exploitation in production systems
  • Implement safeguards against unexpected or adversarial inputs
  • Continuously monitor and update models in deployed systems
  • Example: Self-driving car algorithms must robustly handle diverse traffic scenarios and weather conditions