## Algorithms for Decision Making

by Kochenderfer, Wheeler, Wray

### Instructor Requests

A broad introduction to algorithms for decision making under uncertainty, introducing the underlying mathematical problem formulations and the algorithms for solving them.

Automated decision-making systems or decision-support systems—used in applications that range from aircraft collision avoidance to breast cancer screening—must be designed to account for various sources of uncertainty while carefully balancing multiple objectives. This textbook provides a broad introduction to algorithms for decision making under uncertainty, covering the underlying mathematical problem formulations and the algorithms for solving them.

The book first addresses the problem of reasoning about uncertainty and objectives in simple decisions at a single point in time, and then turns to sequential decision problems in stochastic environments where the outcomes of our actions are uncertain. It goes on to address model uncertainty, when we do not start with a known model and must learn how to act through interaction with the environment; state uncertainty, in which we do not know the current state of the environment due to imperfect perceptual information; and decision contexts involving multiple agents. The book focuses primarily on planning and reinforcement learning, although some of the techniques presented draw on elements of supervised learning and optimization. Algorithms are implemented in the Julia programming language. Figures, examples, and exercises convey the intuition behind the various approaches presented.

Expand/Collapse All
Contents (pg. vii)
Preface (pg. xix)
Acknowledgments (pg. xxi)
1. Introduction (pg. 1)
1.1 Decision Making (pg. 1)
1.2 Applications (pg. 2)
1.3 Methods (pg. 5)
1.4 History (pg. 7)
1.5 Societal Impact (pg. 12)
1.6 Overview (pg. 14)
Part I. Probabilistic Reasoning (pg. 17)
2. Representation (pg. 19)
2.1 Degrees of Belief and Probability (pg. 19)
2.2 Probability Distributions (pg. 20)
2.3 Joint Distributions (pg. 24)
2.4 Conditional Distributions (pg. 29)
2.5 Bayesian Networks (pg. 32)
2.6 Conditional Independence (pg. 35)
2.7 Summary (pg. 36)
2.8 Exercises (pg. 38)
3. Inference (pg. 43)
3.1 Inference in Bayesian Networks (pg. 43)
3.2 Inference in Naive Bayes Models (pg. 48)
3.3 Sum-Product Variable Elimination (pg. 49)
3.4 Belief Propagation (pg. 53)
3.5 Computational Complexity (pg. 53)
3.6 Direct Sampling (pg. 54)
3.7 Likelihood Weighted Sampling (pg. 57)
3.8 Gibbs Sampling (pg. 60)
3.9 Inference in Gaussian Models (pg. 63)
3.10 Summary (pg. 65)
3.11 Exercises (pg. 66)
4. Parameter Learning (pg. 71)
4.1 Maximum Likelihood Parameter Learning (pg. 71)
4.2 Bayesian Parameter Learning (pg. 75)
4.3 Nonparametric Learning (pg. 82)
4.4 Learning with Missing Data (pg. 82)
4.5 Summary (pg. 89)
4.6 Exercises (pg. 89)
5. Structure Learning (pg. 97)
5.1 Bayesian Network Scoring (pg. 97)
5.2 Directed Graph Search (pg. 99)
5.3 Markov Equivalence Classes (pg. 103)
5.4 Partially Directed Graph Search (pg. 104)
5.5 Summary (pg. 106)
5.6 Exercises (pg. 107)
6. Simple Decisions (pg. 111)
6.1 Constraints on Rational Preferences (pg. 111)
6.2 Utility Functions (pg. 112)
6.3 Utility Elicitation (pg. 114)
6.4 Maximum Expected Utility Principle (pg. 116)
6.5 Decision Networks (pg. 116)
6.6 Value of Information (pg. 119)
6.7 Irrationality (pg. 122)
6.8 Summary (pg. 125)
6.9 Exercises (pg. 125)
Part II. Sequential Problems (pg. 131)
7. Exact Solution Methods (pg. 133)
7.1 Markov Decision Processes (pg. 133)
7.2 Policy Evaluation (pg. 136)
7.3 Value Function Policies (pg. 139)
7.4 Policy Iteration (pg. 140)
7.5 Value Iteration (pg. 141)
7.6 Asynchronous Value Iteration (pg. 145)
7.7 Linear Program Formulation (pg. 147)
7.8 Linear Systems with Quadratic Reward (pg. 147)
7.9 Summary (pg. 150)
7.10 Exercises (pg. 151)
8. Approximate Value Functions (pg. 161)
8.1 Parametric Representations (pg. 161)
8.2 Nearest Neighbor (pg. 163)
8.3 Kernel Smoothing (pg. 164)
8.4 Linear Interpolation (pg. 167)
8.5 Simplex Interpolation (pg. 168)
8.6 Linear Regression (pg. 172)
8.7 Neural Network Regression (pg. 174)
8.8 Summary (pg. 175)
8.9 Exercises (pg. 177)
9. Online Planning (pg. 181)
9.1 Receding Horizon Planning (pg. 181)
9.2 Lookahead with Rollouts (pg. 183)
9.3 Forward Search (pg. 183)
9.4 Branch and Bound (pg. 185)
9.5 Sparse Sampling (pg. 187)
9.6 Monte Carlo Tree Search (pg. 187)
9.7 Heuristic Search (pg. 197)
9.8 Labeled Heuristic Search (pg. 197)
9.9 Open-Loop Planning (pg. 200)
9.10 Summary (pg. 208)
9.11 Exercises (pg. 209)
10. Policy Search (pg. 213)
10.1 Approximate Policy Evaluation (pg. 213)
10.2 Local Search (pg. 215)
10.3 Genetic Algorithms (pg. 215)
10.4 Cross Entropy Method (pg. 218)
10.5 Evolution Strategies (pg. 219)
10.6 Isotropic Evolutionary Strategies (pg. 224)
10.7 Summary (pg. 226)
10.8 Exercises (pg. 226)
11. Policy Gradient Estimation (pg. 231)
11.1 Finite Difference (pg. 231)
11.3 Likelihood Ratio (pg. 234)
11.4 Reward-to-Go (pg. 237)
11.5 Baseline Subtraction (pg. 241)
11.6 Summary (pg. 245)
11.7 Exercises (pg. 246)
12. Policy Gradient Optimization (pg. 249)
12.1 Gradient Ascent Update (pg. 249)
12.2 Restricted Gradient Update (pg. 251)
12.3 Natural Gradient Update (pg. 253)
12.4 Trust Region Update (pg. 254)
12.5 Clamped Surrogate Objective (pg. 257)
12.6 Summary (pg. 263)
12.7 Exercises (pg. 264)
13. Actor-Critic Methods (pg. 267)
13.1 Actor-Critic (pg. 267)
13.2 Generalized Advantage Estimation (pg. 269)
13.3 Deterministic Policy Gradient (pg. 272)
13.4 Actor-Critic with Monte Carlo Tree Search (pg. 274)
13.5 Summary (pg. 277)
13.6 Exercises (pg. 277)
14. Policy Validation (pg. 281)
14.1 Performance Metric Evaluation (pg. 281)
14.2 Rare Event Simulation (pg. 285)
14.3 Robustness Analysis (pg. 288)
14.6 Summary (pg. 295)
14.7 Exercises (pg. 295)
Part III. Model Uncertainty (pg. 297)
15. Exploration and Exploitation (pg. 299)
15.1 Bandit Problems (pg. 299)
15.2 Bayesian Model Estimation (pg. 301)
15.3 Undirected Exploration Strategies (pg. 301)
15.4 Directed Exploration Strategies (pg. 303)
15.5 Optimal Exploration Strategies (pg. 306)
15.6 Exploration with Multiple States (pg. 309)
15.7 Summary (pg. 309)
15.8 Exercises (pg. 311)
16. Model-Based Methods (pg. 317)
16.1 Maximum Likelihood Models (pg. 317)
16.2 Update Schemes (pg. 318)
16.3 Exploration (pg. 321)
16.4 Bayesian Methods (pg. 326)
16.5 Bayes-Adaptive Markov Decision Processes (pg. 329)
16.6 Posterior Sampling (pg. 330)
16.7 Summary (pg. 332)
16.8 Exercises (pg. 332)
17. Model-Free Methods (pg. 335)
17.1 Incremental Estimation of the Mean (pg. 335)
17.2 Q-Learning (pg. 336)
17.3 Sarsa (pg. 338)
17.4 Eligibility Traces (pg. 341)
17.5 Reward Shaping (pg. 343)
17.6 Action Value Function Approximation (pg. 343)
17.7 Experience Replay (pg. 345)
17.8 Summary (pg. 348)
17.9 Exercises (pg. 351)
18. Imitation Learning (pg. 355)
18.1 Behavioral Cloning (pg. 355)
18.2 Data Set Aggregation (pg. 358)
18.3 Stochastic Mixing Iterative Learning (pg. 358)
18.4 Maximum Margin Inverse Reinforcement Learning (pg. 361)
18.5 Maximum Entropy Inverse Reinforcement Learning (pg. 365)
18.6 Generative Adversarial Imitation Learning (pg. 369)
18.7 Summary (pg. 371)
18.8 Exercises (pg. 372)
Part IV. State Uncertainty (pg. 377)
19. Beliefs (pg. 379)
19.1 Belief Initialization (pg. 379)
19.2 Discrete State Filter (pg. 380)
19.3 Kalman Filter (pg. 383)
19.4 Extended Kalman Filter (pg. 385)
19.5 Unscented Kalman Filter (pg. 387)
19.6 Particle Filter (pg. 390)
19.7 Particle Injection (pg. 394)
19.8 Summary (pg. 395)
19.9 Exercises (pg. 397)
20. Exact Belief State Planning (pg. 407)
20.1 Belief-State Markov Decision Processes (pg. 407)
20.2 Conditional Plans (pg. 408)
20.3 Alpha Vectors (pg. 411)
20.4 Pruning (pg. 412)
20.5 Value Iteration (pg. 416)
20.6 Linear Policies (pg. 419)
20.7 Summary (pg. 419)
20.8 Exercises (pg. 422)
21. Offline Belief State Planning (pg. 427)
21.1 Fully Observable Value Approximation (pg. 427)
21.2 Fast Informed Bound (pg. 429)
21.3 Fast Lower Bounds (pg. 430)
21.4 Point-Based Value Iteration (pg. 431)
21.5 Randomized Point-Based Value Iteration (pg. 433)
21.6 Sawtooth Upper Bound (pg. 436)
21.7 Point Selection (pg. 440)
21.8 Sawtooth Heuristic Search (pg. 442)
21.9 Triangulated Value Functions (pg. 445)
21.10 Summary (pg. 447)
21.11 Exercises (pg. 448)
22. Online Belief State Planning (pg. 453)
22.1 Lookahead with Rollouts (pg. 453)
22.2 Forward Search (pg. 453)
22.3 Branch and Bound (pg. 456)
22.4 Sparse Sampling (pg. 456)
22.5 Monte Carlo Tree Search (pg. 457)
22.6 Determinized Sparse Tree Search (pg. 459)
22.7 Gap Heuristic Search (pg. 460)
22.8 Summary (pg. 464)
22.9 Exercises (pg. 467)
23. Controller Abstractions (pg. 471)
23.1 Controllers (pg. 471)
23.2 Policy Iteration (pg. 475)
23.3 Nonlinear Programming (pg. 478)
23.5 Summary (pg. 486)
23.6 Exercises (pg. 486)
Part V. Multiagent Systems (pg. 491)
24. Multiagent Reasoning (pg. 493)
24.1 Simple Games (pg. 493)
24.2 Response Models (pg. 494)
24.3 Dominant Strategy Equilibrium (pg. 497)
24.4 Nash Equilibrium (pg. 498)
24.5 Correlated Equilibrium (pg. 498)
24.6 Iterated Best Response (pg. 503)
24.7 Hierarchical Softmax (pg. 504)
24.8 Fictitious Play (pg. 505)
24.10 Summary (pg. 509)
24.11 Exercises (pg. 511)
25. Sequential Problems (pg. 517)
25.1 Markov Games (pg. 517)
25.2 Response Models (pg. 519)
25.3 Nash Equilibrium (pg. 520)
25.4 Fictitious Play (pg. 521)
25.6 Nash Q-Learning (pg. 526)
25.7 Summary (pg. 528)
25.8 Exercises (pg. 530)
26. State Uncertainty (pg. 533)
26.1 Partially Observable Markov Games (pg. 533)
26.2 Policy Evaluation (pg. 535)
26.3 Nash Equilibrium (pg. 537)
26.4 Dynamic Programming (pg. 540)
26.5 Summary (pg. 542)
26.6 Exercises (pg. 542)
27. Collaborative Agents (pg. 545)
27.1 Decentralized Partially Observable Markov Decision Processes (pg. 545)
27.2 Subclasses (pg. 546)
27.3 Dynamic Programming (pg. 549)
27.4 Iterated Best Response (pg. 550)
27.5 Heuristic Search (pg. 550)
27.6 Nonlinear Programming (pg. 551)
27.7 Summary (pg. 554)
27.8 Exercises (pg. 556)
Appendices (pg. 559)
A. Mathematical Concepts (pg. 561)
A.1 Measure Spaces (pg. 561)
A.2 Probability Spaces (pg. 562)
A.3 Metric Spaces (pg. 562)
A.4 Normed Vector Spaces (pg. 562)
A.5 Positive Definiteness (pg. 564)
A.6 Convexity (pg. 564)
A.7 Information Content (pg. 565)
A.8 Entropy (pg. 566)
A.9 Cross Entropy (pg. 566)
A.10 Relative Entropy (pg. 567)
A.12 Taylor Expansion (pg. 568)
A.13 Monte Carlo Estimation (pg. 569)
A.14 Importance Sampling (pg. 570)
A.15 Contraction Mappings (pg. 570)
A.16 Graphs (pg. 572)
B. Probability Distributions (pg. 573)
C. Computational Complexity (pg. 575)
C.1 Asymptotic Notation (pg. 575)
C.2 Time Complexity Classes (pg. 577)
C.3 Space Complexity Classes (pg. 577)
C.4 Decidability (pg. 579)
D. Neural Representations (pg. 581)
D.1 Neural Networks (pg. 581)
D.2 Feedforward Networks (pg. 582)
D.3 Parameter Regularization (pg. 585)
D.4 Convolutional Neural Networks (pg. 587)
D.5 Recurrent Networks (pg. 588)
D.6 Autoencoder Networks (pg. 592)
E. Search Algorithms (pg. 599)
E.1 Search Problems (pg. 599)
E.2 Search Graphs (pg. 600)
E.3 Forward Search (pg. 600)
E.4 Branch and Bound (pg. 601)
E.5 Dynamic Programming (pg. 604)
E.6 Heuristic Search (pg. 604)
F. Problems (pg. 609)
F.1 Hex World (pg. 609)
F.2 2048 (pg. 610)
F.3 Cart-Pole (pg. 611)
F.4 Mountain Car (pg. 612)
F.5 Simple Regulator (pg. 613)
F.6 Aircraft Collision Avoidance (pg. 614)
F.7 Crying Baby (pg. 615)
F.8 Machine Replacement (pg. 617)
F.9 Catch (pg. 619)
F.10 Prisoner’s Dilemma (pg. 621)
F.11 Rock-Paper-Scissors (pg. 621)
F.12 Traveler’s Dilemma (pg. 622)
F.13 Predator-Prey Hex World (pg. 623)
F.14 Multicaregiver Crying Baby (pg. 624)
F.15 Collaborative Predator-Prey Hex World (pg. 625)
G. Julia (pg. 627)
G.1 Types (pg. 627)
G.2 Functions (pg. 640)
G.3 Control Flow (pg. 643)
G.4 Packages (pg. 645)
G.5 Convenience Functions (pg. 648)
References (pg. 651)
Index (pg. 671)

#### Mykel J. Kochenderfer

Mykel Kochenderfer is Associate Professor at Stanford University, where he is Director of the Stanford Intelligent Systems Laboratory (SISL). He is the author of Decision Making Under Uncertainty (MIT Press). Kochenderfer and Tim Wheeler are coauthors of Algorithms for Optimization (MIT Press).

#### Tim A. Wheeler

Tim Wheeler is a software engineer in the Bay Area, working on autonomy, controls, and decision-making systems. Wheeler and Mykel Kochenderfer are coauthors of Algorithms for Optimization (MIT Press).

#### Kyle H. Wray

Kyle Wray is a researcher who designs and implements the decision-making systems on real-world robots.

eTextbook