Foundations of Machine Learning
by Rostamizadeh, Talwalkar, Mohri
ISBN: 9780262364126  Copyright 2020
Instructor Requests
A new edition of a graduatelevel machine learning textbook that focuses on the analysis and theory of algorithms.
This book is a general introduction to machine learning that can serve as a textbook for graduate students and a reference for researchers. It covers fundamental modern topics in machine learning while providing the theoretical basis and conceptual tools needed for the discussion and justification of algorithms. It also describes several key aspects of the application of these algorithms. The authors aim to present novel theoretical tools and concepts while giving concise proofs even for relatively advanced topics.
Foundations of Machine Learning is unique in its focus on the analysis and theory of algorithms. The first four chapters lay the theoretical foundation for what follows; subsequent chapters are mostly selfcontained. Topics covered include the Probably Approximately Correct (PAC) learning framework; generalization bounds based on Rademacher complexity and VCdimension; Support Vector Machines (SVM); kernel methods; boosting; online learning; multiclass classification; ranking; regression; algorithmic stability; dimensionality reduction; learning automata and languages; and reinforcement learning. Each chapter ends with a set of exercises. Appendixes aoffer dditional material including concise probability review.
This second edition offers three new chapters, on model selection, maximum entropy models, and conditional entropy models. New material in the appendixes includes a major section on Fenchel duality, expanded coverage of concentration inequalities, and an entirely new entry on information theory. More than half of the exercises are new to this edition.
Expand/Collapse All  

Contents (pg. v)  
Preface (pg. xiii)  
1. Introduction (pg. 1)  
1.1 What is machine learning? (pg. 1)  
1.2 What kind of problems can be tackled using machine learning? (pg. 2)  
1.3 Some standard learning tasks (pg. 3)  
1.4 Learning stages (pg. 4)  
1.5 Learning scenarios (pg. 6)  
1.6 Generalization (pg. 7)  
2. The PAC Learning Framework (pg. 9)  
2.1 The PAC learning model (pg. 9)  
2.2 Guarantees for finite hypothesis sets — consistent case (pg. 15)  
2.3 Guarantees for finite hypothesis sets — inconsistent case&# (pg. 19)  
2.4 Generalities (pg. 21)  
2.4.1 Deterministic versus stochastic scenarios (pg. 21)  
2.4.2 Bayes error and noise (pg. 22)  
2.5 Chapter notes (pg. 23)  
2.6 Exercises (pg. 23)  
3. Rademacher Complexity and VCDimension (pg. 29)  
3.1 Rademacher complexity (pg. 30)  
3.2 Growth function (pg. 34)  
3.3 VCdimension (pg. 36)  
3.4 Lower bounds (pg. 43)  
3.5 Chapter notes (pg. 48)  
3.6 Exercises (pg. 50)  
4. Model Selection (pg. 61)  
4.1 Estimation and approximation errors (pg. 61)  
4.2 Empirical risk minimization (ERM) (pg. 62)  
4.3 Structural risk minimization (SRM) (pg. 64)  
4.4 Crossvalidation (pg. 68)  
4.5 nFold crossvalidation (pg. 71)  
4.6 Regularizationbased algorithms (pg. 72)  
4.7 Convex surrogate losses (pg. 73)  
4.8 Chapter notes (pg. 77)  
4.9 Exercises (pg. 78)  
5. Support Vector Machines (pg. 79)  
5.1 Linear classification (pg. 79)  
5.2 Separable case (pg. 80)  
5.2.1 Primal optimization problem (pg. 81)  
5.2.2 Support vectors (pg. 83)  
5.2.3 Dual optimization problem (pg. 83)  
5.2.4 Leaveoneout analysis (pg. 85)  
5.3 Nonseparable case (pg. 87)  
5.3.1 Primal optimization problem (pg. 88)  
5.3.2 Support vectors (pg. 89)  
5.3.3 Dual optimization problem (pg. 90)  
5.4 Margin theory (pg. 91)  
5.5 Chapter notes (pg. 100)  
5.6 Exercises (pg. 100)  
6. Kernel Methods (pg. 105)  
6.1 Introduction (pg. 105)  
6.2 Positive definite symmetric kernels (pg. 108)  
6.2.1 Definitions (pg. 108)  
6.2.2 Reproducing kernel Hilbert space (pg. 110)  
6.2.3 Properties (pg. 112)  
6.3 Kernelbased algorithms (pg. 116)  
6.3.1 SVMs with PDS kernels (pg. 116)  
6.3.2 Representer theorem (pg. 117)  
6.3.3 Learning guarantees (pg. 117)  
6.4 Negative definite symmetric kernels (pg. 119)  
6.5 Sequence kernels (pg. 121)  
6.5.1 Weighted transducers (pg. 122)  
6.5.2 Rational kernels (pg. 126)  
6.6 Approximate kernel feature maps (pg. 130)  
6.7 Chapter notes (pg. 135)  
6.8 Exercises (pg. 137)  
7. Boosting (pg. 145)  
7.1 Introduction (pg. 145)  
7.2 AdaBoost (pg. 146)  
7.2.1 Bound on the empirical error (pg. 149)  
7.2.2 Relationship with coordinate descent (pg. 150)  
7.2.3 Practical use (pg. 154)  
7.3 Theoretical results (pg. 154)  
7.3.1 VCdimensionbased analysis (pg. 154)  
7.3.2 L1geometric margin (pg. 155)  
7.3.3 Marginbased analysis (pg. 157)  
7.3.4 Margin maximization (pg. 161)  
7.3.5 Gametheoretic interpretation (pg. 162)  
7.4 L1regularization (pg. 165)  
7.5 Discussion (pg. 167)  
7.6 Chapter notes (pg. 168)  
7.7 Exercises (pg. 170)  
8. OnLine Learning (pg. 177)  
8.1 Introduction (pg. 178)  
8.2 Prediction with expert advice (pg. 178)  
8.2.1 Mistake bounds and Halving algorithm (pg. 179)  
8.2.2 Weighted majority algorithm (pg. 181)  
8.2.3 Randomized weighted majority algorithm (pg. 183)  
8.2.4 Exponential weighted average algorithm (pg. 186)  
8.3 Linear classification (pg. 190)  
8.3.1 Perceptron algorithm (pg. 190)  
8.3.2 Winnow algorithm (pg. 198)  
8.4 Online to batch conversion (pg. 201)  
8.5 Gametheoretic connection (pg. 204)  
8.6 Chapter notes (pg. 205)  
8.7 Exercises (pg. 206)  
9. MultiClass Classification (pg. 213)  
9.1 Multiclass classification problem (pg. 213)  
9.2 Generalization bounds (pg. 215)  
9.3 Uncombined multiclass algorithms (pg. 221)  
9.3.1 Multiclass SVMs (pg. 221)  
9.3.2 Multiclass boosting algorithms (pg. 222)  
9.3.3 Decision trees (pg. 224)  
9.4 Aggregated multiclass algorithms (pg. 228)  
9.4.1 Oneversusall (pg. 229)  
9.4.2 Oneversusone (pg. 229)  
9.4.3 Errorcorrecting output codes (pg. 231)  
9.5 Structured prediction algorithms (pg. 233)  
9.6 Chapter notes (pg. 235)  
9.7 Exercises (pg. 237)  
10. Ranking (pg. 239)  
10.1 The problem of ranking (pg. 240)  
10.2 Generalization bound (pg. 241)  
10.3 Ranking with SVMs (pg. 243)  
10.4 RankBoost (pg. 244)  
10.4.1 Bound on the empirical error (pg. 246)  
10.4.2 Relationship with coordinate descent (pg. 248)  
10.4.3 Margin bound for ensemble methods in ranking (pg. 250)  
10.5 Bipartite ranking (pg. 251)  
10.5.1 Boosting in bipartite ranking (pg. 252)  
10.5.2 Area under the ROC curve (pg. 255)  
10.6 Preferencebased setting (pg. 257)  
10.6.1 Secondstage ranking problem (pg. 257)  
10.6.2 Deterministic algorithm (pg. 259)  
10.6.3 Randomized algorithm (pg. 260)  
10.6.4 Extension to other loss functions (pg. 262)  
10.7 Other ranking criteria (pg. 262)  
10.8 Chapter notes (pg. 263)  
10.9 Exercises (pg. 264)  
11. Regression (pg. 267)  
11.1 The problem of regression (pg. 267)  
11.2 Generalization bounds (pg. 268)  
11.2.1 Finite hypothesis sets (pg. 268)  
11.2.2 Rademacher complexity bounds (pg. 269)  
11.2.3 Pseudodimension bounds (pg. 271)  
11.3 Regression algorithms (pg. 275)  
11.3.1 Linear regression (pg. 275)  
11.3.2 Kernel ridge regression (pg. 276)  
11.3.3 Support vector regression (pg. 281)  
11.3.4 Lasso (pg. 285)  
11.3.5 Group norm regression algorithms (pg. 289)  
11.3.6 Online regression algorithms (pg. 289)  
11.4 Chapter notes (pg. 290)  
11.5 Exercises (pg. 292)  
12. Maximum Entropy Models (pg. 295)  
12.1 Density estimation problem (pg. 295)  
12.1.1 Maximum Likelihood (ML) solution (pg. 296)  
12.1.2 Maximum a Posteriori (MAP) solution (pg. 297)  
12.2 Density estimation problem augmented with features (pg. 297)  
12.3 Maxent principle (pg. 298)  
12.4 Maxent models (pg. 299)  
12.5 Dual problem (pg. 299)  
12.6 Generalization bound (pg. 303)  
12.7 Coordinate descent algorithm (pg. 304)  
12.8 Extensions (pg. 306)  
12.9 L2regularization (pg. 308)  
12.10 Chapter notes (pg. 312)  
12.11 Exercises (pg. 313)  
13. Conditional Maximum Entropy Models (pg. 315)  
13.1 Learning problem (pg. 315)  
13.2 Conditional Maxent principle (pg. 316)  
13.3 Conditional Maxent models (pg. 316)  
13.4 Dual problem (pg. 317)  
13.5 Properties (pg. 319)  
13.5.1 Optimization problem (pg. 320)  
13.5.2 Feature vectors (pg. 320)  
13.5.3 Prediction (pg. 321)  
13.6 Generalization bounds (pg. 321)  
13.7 Logistic regression (pg. 325)  
13.7.1 Optimization problem (pg. 325)  
13.7.2 Logistic model (pg. 325)  
13.8 L2regularization (pg. 326)  
13.9 Proof of the duality theorem (pg. 328)  
13.10 Chapter notes (pg. 330)  
13.11 Exercises (pg. 331)  
14. Algorithmic Stability (pg. 333)  
14.1 Definitions (pg. 333)  
14.2 Stabilitybased generalization guarantee (pg. 334)  
14.3 Stability of kernelbased regularization algorithms (pg. 336)  
14.3.1 Application to regression algorithms: SVR and KRR (pg. 339)  
14.3.2 Application to classification algorithms: SVMs (pg. 341)  
14.3.3 Discussion (pg. 342)  
14.4 Chapter notes (pg. 342)  
14.5 Exercises (pg. 343)  
15. Dimensionality Reduction (pg. 347)  
15.1 Principal component analysis (pg. 348)  
15.2 Kernel principal component analysis (KPCA) (pg. 349)  
15.3 KPCA and manifold learning (pg. 351)  
15.3.1 Isomap (pg. 351)  
15.3.2 Laplacian eigenmaps (pg. 352)  
15.3.3 Locally linear embedding (LLE) (pg. 353)  
15.4 JohnsonLindenstrauss lemma (pg. 354)  
15.5 Chapter notes (pg. 356)  
15.6 Exercises (pg. 356)  
16. Learning Automata and Languages (pg. 359)  
16.1 Introduction (pg. 359)  
16.2 Finite automata (pg. 360)  
16.3 Efficient exact learning (pg. 361)  
16.3.1 Passive learning (pg. 362)  
16.3.2 Learning with queries (pg. 363)  
16.3.3 Learning automata with queries (pg. 364)  
16.4 Identification in the limit (pg. 369)  
16.4.1 Learning reversible automata (pg. 370)  
16.5 Chapter notes (pg. 375)  
16.6 Exercises (pg. 376)  
17. Reinforcement Learning (pg. 379)  
17.1 Learning scenario (pg. 379)  
17.2 Markov decision process model (pg. 380)  
17.3 Policy (pg. 381)  
17.3.1 Definition (pg. 381)  
17.3.2 Policy value (pg. 382)  
17.3.3 Optimal policies (pg. 382)  
17.3.4 Policy evaluation (pg. 385)  
17.4 Planning algorithms (pg. 387)  
17.4.1 Value iteration (pg. 387)  
17.4.2 Policy iteration (pg. 390)  
17.4.3 Linear programming (pg. 392)  
17.5 Learning algorithms (pg. 393)  
17.5.1 Stochastic approximation (pg. 394)  
17.5.2 TD(0) algorithm (pg. 397)  
17.5.3 Qlearning algorithm (pg. 398)  
17.5.4 SARSA (pg. 402)  
17.5.5 TD(λ) algorithm (pg. 402)  
17.5.6 Large state space (pg. 403)  
17.6 Chapter notes (pg. 405)  
Conclusion (pg. 407)  
A. Linear Algebra Review (pg. 409)  
A.1 Vectors and norms (pg. 409)  
A.1.1 Norms (pg. 409)  
A.1.2 Dual norms (pg. 410)  
A.1.3 Relationship between norms (pg. 411)  
A.2 Matrices (pg. 411)  
A.2.1 Matrix norms (pg. 411)  
A.2.2 Singular value decomposition (pg. 412)  
A.2.3 Symmetric positive semidefinite (SPSD) matrices (pg. 412)  
B. Convex Optimization (pg. 415)  
B.1 Differentiation and unconstrained optimization (pg. 415)  
B.2 Convexity (pg. 415)  
B.3 Constrained optimization (pg. 419)  
B.4 Fenchel duality (pg. 422)  
B.4.1 Subgradients (pg. 422)  
B.4.2 Core (pg. 423)  
B.4.3 Conjugate functions (pg. 423)  
B.5 Chapter notes (pg. 426)  
B.6 Exercises (pg. 427)  
C. Probability Review (pg. 429)  
C.1 Probability (pg. 429)  
C.2 Random variables (pg. 429)  
C.3 Conditional probability and independence (pg. 431)  
C.4 Expectation and Markov’s inequality (pg. 431)  
C.5 Variance and Chebyshev’s inequality (pg. 432)  
C.6 Momentgenerating functions (pg. 434)  
C.7 Exercises (pg. 435)  
D. Concentration Inequalities (pg. 437)  
D.1 Hoeffding’s inequality (pg. 437)  
D.2 Sanov’s theorem (pg. 438)  
D.3 Multiplicative Chernoff bounds (pg. 439)  
D.4 Binomial distribution tails: Upper bounds (pg. 440)  
D.5 Binomial distribution tails: Lower bound (pg. 440)  
D.6 Azuma’s inequality (pg. 441)  
D.7 McDiarmid’s inequality (pg. 442)  
D.8 Normal distribution tails: Lower bound (pg. 443)  
D.9 KhintchineKahane inequality (pg. 443)  
D.10 Maximal inequality (pg. 444)  
D.11 Chapter notes (pg. 445)  
D.12 Exercises (pg. 445)  
E. Notions of Information Theory (pg. 449)  
E.1 Entropy (pg. 449)  
E.2 Relative entropy (pg. 450)  
E.3 Mutual information (pg. 453)  
E.4 Bregman divergences (pg. 453)  
E.5 Chapter notes (pg. 456)  
E.6 Exercises (pg. 457)  
F. Notation (pg. 459)  
Bibliography (pg. 461)  
Index (pg. 475) 
Afshin Rostamizadeh
Afshin Rostamizadeh is a Research Scientist at Google Research.
Ameet Talwalkar
Ameet Talwalkar is a National Science Foundation Postdoctoral Fellow in the Department of Electrical Engineering and Computer Science at the University of California, Berkeley.
Mehryar Mohri
Mehryar Mohri is Professor of Computer Science at New York University’s Courant Institute of Mathematical Sciences and a Research Consultant at Google Research.
Instructors  

You must have an instructor account and submit a request to access instructor materials for this book.

eTextbook
Go paperless today! Available online anytime, nothing to download or install.
Features
