Foundations of Machine Learning, Second Edition

by Mohri, Rostamizadeh, Talwalkar

| ISBN: 9780262351355 | Copyright 2020

Click here to preview

Instructor Requests

Digital Exam/Desk Copy Print Exam/Desk Copy Ancillaries
Tabs

A new edition of a graduate-level machine learning textbook that focuses on the analysis and theory of algorithms.

This book is a general introduction to machine learning that can serve as a textbook for graduate students and a reference for researchers. It covers fundamental modern topics in machine learning while providing the theoretical basis and conceptual tools needed for the discussion and justification of algorithms. It also describes several key aspects of the application of these algorithms. The authors aim to present novel theoretical tools and concepts while giving concise proofs even for relatively advanced topics.

Foundations of Machine Learning is unique in its focus on the analysis and theory of algorithms. The first four chapters lay the theoretical foundation for what follows; subsequent chapters are mostly self-contained. Topics covered include the Probably Approximately Correct (PAC) learning framework; generalization bounds based on Rademacher complexity and VC-dimension; Support Vector Machines (SVM); kernel methods; boosting; on-line learning; multi-class classification; ranking; regression; algorithmic stability; dimensionality reduction; learning automata and languages; and reinforcement learning. Each chapter ends with a set of exercises. Appendixes aoffer dditional material including concise probability review.

This second edition offers three new chapters, on model selection, maximum entropy models, and conditional entropy models. New material in the appendixes includes a major section on Fenchel duality, expanded coverage of concentration inequalities, and an entirely new entry on information theory. More than half of the exercises are new to this edition.

Expand/Collapse All
Contents (pg. v)
Preface (pg. xiii)
1. Introduction (pg. 1)
1.1 What is machine learning? (pg. 1)
1.2 What kind of problems can be tackled using machine learning? (pg. 2)
1.3 Some standard learning tasks (pg. 3)
1.4 Learning stages (pg. 4)
1.5 Learning scenarios (pg. 6)
1.6 Generalization (pg. 7)
2. The PAC Learning Framework (pg. 9)
2.1 The PAC learning model (pg. 9)
2.2 Guarantees for finite hypothesis sets — consistent case (pg. 15)
2.3 Guarantees for finite hypothesis sets — inconsistent case&# (pg. 19)
2.4 Generalities (pg. 21)
2.4.1 Deterministic versus stochastic scenarios (pg. 21)
2.4.2 Bayes error and noise (pg. 22)
2.5 Chapter notes (pg. 23)
2.6 Exercises (pg. 23)
3. Rademacher Complexity and VC-Dimension (pg. 29)
3.1 Rademacher complexity (pg. 30)
3.2 Growth function (pg. 34)
3.3 VC-dimension (pg. 36)
3.4 Lower bounds (pg. 43)
3.5 Chapter notes (pg. 48)
3.6 Exercises (pg. 50)
4. Model Selection (pg. 61)
4.1 Estimation and approximation errors (pg. 61)
4.2 Empirical risk minimization (ERM) (pg. 62)
4.3 Structural risk minimization (SRM) (pg. 64)
4.4 Cross-validation (pg. 68)
4.5 n-Fold cross-validation (pg. 71)
4.6 Regularization-based algorithms (pg. 72)
4.7 Convex surrogate losses (pg. 73)
4.8 Chapter notes (pg. 77)
4.9 Exercises (pg. 78)
5. Support Vector Machines (pg. 79)
5.1 Linear classification (pg. 79)
5.2 Separable case (pg. 80)
5.2.1 Primal optimization problem (pg. 81)
5.2.2 Support vectors (pg. 83)
5.2.3 Dual optimization problem (pg. 83)
5.2.4 Leave-one-out analysis (pg. 85)
5.3 Non-separable case (pg. 87)
5.3.1 Primal optimization problem (pg. 88)
5.3.2 Support vectors (pg. 89)
5.3.3 Dual optimization problem (pg. 90)
5.4 Margin theory (pg. 91)
5.5 Chapter notes (pg. 100)
5.6 Exercises (pg. 100)
6. Kernel Methods (pg. 105)
6.1 Introduction (pg. 105)
6.2 Positive definite symmetric kernels (pg. 108)
6.2.1 Definitions (pg. 108)
6.2.2 Reproducing kernel Hilbert space (pg. 110)
6.2.3 Properties (pg. 112)
6.3 Kernel-based algorithms (pg. 116)
6.3.1 SVMs with PDS kernels (pg. 116)
6.3.2 Representer theorem (pg. 117)
6.3.3 Learning guarantees (pg. 117)
6.4 Negative definite symmetric kernels (pg. 119)
6.5 Sequence kernels (pg. 121)
6.5.1 Weighted transducers (pg. 122)
6.5.2 Rational kernels (pg. 126)
6.6 Approximate kernel feature maps (pg. 130)
6.7 Chapter notes (pg. 135)
6.8 Exercises (pg. 137)
7. Boosting (pg. 145)
7.1 Introduction (pg. 145)
7.2 AdaBoost (pg. 146)
7.2.1 Bound on the empirical error (pg. 149)
7.2.2 Relationship with coordinate descent (pg. 150)
7.2.3 Practical use (pg. 154)
7.3 Theoretical results (pg. 154)
7.3.1 VC-dimension-based analysis (pg. 154)
7.3.2 L1-geometric margin (pg. 155)
7.3.3 Margin-based analysis (pg. 157)
7.3.4 Margin maximization (pg. 161)
7.3.5 Game-theoretic interpretation (pg. 162)
7.4 L1-regularization (pg. 165)
7.5 Discussion (pg. 167)
7.6 Chapter notes (pg. 168)
7.7 Exercises (pg. 170)
8. On-Line Learning (pg. 177)
8.1 Introduction (pg. 178)
8.2 Prediction with expert advice (pg. 178)
8.2.1 Mistake bounds and Halving algorithm (pg. 179)
8.2.2 Weighted majority algorithm (pg. 181)
8.2.3 Randomized weighted majority algorithm (pg. 183)
8.2.4 Exponential weighted average algorithm (pg. 186)
8.3 Linear classification (pg. 190)
8.3.1 Perceptron algorithm (pg. 190)
8.3.2 Winnow algorithm (pg. 198)
8.4 On-line to batch conversion (pg. 201)
8.5 Game-theoretic connection (pg. 204)
8.6 Chapter notes (pg. 205)
8.7 Exercises (pg. 206)
9. Multi-Class Classification (pg. 213)
9.1 Multi-class classification problem (pg. 213)
9.2 Generalization bounds (pg. 215)
9.3 Uncombined multi-class algorithms (pg. 221)
9.3.1 Multi-class SVMs (pg. 221)
9.3.2 Multi-class boosting algorithms (pg. 222)
9.3.3 Decision trees (pg. 224)
9.4 Aggregated multi-class algorithms (pg. 228)
9.4.1 One-versus-all (pg. 229)
9.4.2 One-versus-one (pg. 229)
9.4.3 Error-correcting output codes (pg. 231)
9.5 Structured prediction algorithms (pg. 233)
9.6 Chapter notes (pg. 235)
9.7 Exercises (pg. 237)
10. Ranking (pg. 239)
10.1 The problem of ranking (pg. 240)
10.2 Generalization bound (pg. 241)
10.3 Ranking with SVMs (pg. 243)
10.4 RankBoost (pg. 244)
10.4.1 Bound on the empirical error (pg. 246)
10.4.2 Relationship with coordinate descent (pg. 248)
10.4.3 Margin bound for ensemble methods in ranking (pg. 250)
10.5 Bipartite ranking (pg. 251)
10.5.1 Boosting in bipartite ranking (pg. 252)
10.5.2 Area under the ROC curve (pg. 255)
10.6 Preference-based setting (pg. 257)
10.6.1 Second-stage ranking problem (pg. 257)
10.6.2 Deterministic algorithm (pg. 259)
10.6.3 Randomized algorithm (pg. 260)
10.6.4 Extension to other loss functions (pg. 262)
10.7 Other ranking criteria (pg. 262)
10.8 Chapter notes (pg. 263)
10.9 Exercises (pg. 264)
11. Regression (pg. 267)
11.1 The problem of regression (pg. 267)
11.2 Generalization bounds (pg. 268)
11.2.1 Finite hypothesis sets (pg. 268)
11.2.2 Rademacher complexity bounds (pg. 269)
11.2.3 Pseudo-dimension bounds (pg. 271)
11.3 Regression algorithms (pg. 275)
11.3.1 Linear regression (pg. 275)
11.3.2 Kernel ridge regression (pg. 276)
11.3.3 Support vector regression (pg. 281)
11.3.4 Lasso (pg. 285)
11.3.5 Group norm regression algorithms (pg. 289)
11.3.6 On-line regression algorithms (pg. 289)
11.4 Chapter notes (pg. 290)
11.5 Exercises (pg. 292)
12. Maximum Entropy Models (pg. 295)
12.1 Density estimation problem (pg. 295)
12.1.1 Maximum Likelihood (ML) solution (pg. 296)
12.1.2 Maximum a Posteriori (MAP) solution (pg. 297)
12.2 Density estimation problem augmented with features (pg. 297)
12.3 Maxent principle (pg. 298)
12.4 Maxent models (pg. 299)
12.5 Dual problem (pg. 299)
12.6 Generalization bound (pg. 303)
12.7 Coordinate descent algorithm (pg. 304)
12.8 Extensions (pg. 306)
12.9 L2-regularization (pg. 308)
12.10 Chapter notes (pg. 312)
12.11 Exercises (pg. 313)
13. Conditional Maximum Entropy Models (pg. 315)
13.1 Learning problem (pg. 315)
13.2 Conditional Maxent principle (pg. 316)
13.3 Conditional Maxent models (pg. 316)
13.4 Dual problem (pg. 317)
13.5 Properties (pg. 319)
13.5.1 Optimization problem (pg. 320)
13.5.2 Feature vectors (pg. 320)
13.5.3 Prediction (pg. 321)
13.6 Generalization bounds (pg. 321)
13.7 Logistic regression (pg. 325)
13.7.1 Optimization problem (pg. 325)
13.7.2 Logistic model (pg. 325)
13.8 L2-regularization (pg. 326)
13.9 Proof of the duality theorem (pg. 328)
13.10 Chapter notes (pg. 330)
13.11 Exercises (pg. 331)
14. Algorithmic Stability (pg. 333)
14.1 Definitions (pg. 333)
14.2 Stability-based generalization guarantee (pg. 334)
14.3 Stability of kernel-based regularization algorithms (pg. 336)
14.3.1 Application to regression algorithms: SVR and KRR (pg. 339)
14.3.2 Application to classification algorithms: SVMs (pg. 341)
14.3.3 Discussion (pg. 342)
14.4 Chapter notes (pg. 342)
14.5 Exercises (pg. 343)
15. Dimensionality Reduction (pg. 347)
15.1 Principal component analysis (pg. 348)
15.2 Kernel principal component analysis (KPCA) (pg. 349)
15.3 KPCA and manifold learning (pg. 351)
15.3.1 Isomap (pg. 351)
15.3.2 Laplacian eigenmaps (pg. 352)
15.3.3 Locally linear embedding (LLE) (pg. 353)
15.4 Johnson-Lindenstrauss lemma (pg. 354)
15.5 Chapter notes (pg. 356)
15.6 Exercises (pg. 356)
16. Learning Automata and Languages (pg. 359)
16.1 Introduction (pg. 359)
16.2 Finite automata (pg. 360)
16.3 Efficient exact learning (pg. 361)
16.3.1 Passive learning (pg. 362)
16.3.2 Learning with queries (pg. 363)
16.3.3 Learning automata with queries (pg. 364)
16.4 Identification in the limit (pg. 369)
16.4.1 Learning reversible automata (pg. 370)
16.5 Chapter notes (pg. 375)
16.6 Exercises (pg. 376)
17. Reinforcement Learning (pg. 379)
17.1 Learning scenario (pg. 379)
17.2 Markov decision process model (pg. 380)
17.3 Policy (pg. 381)
17.3.1 Definition (pg. 381)
17.3.2 Policy value (pg. 382)
17.3.3 Optimal policies (pg. 382)
17.3.4 Policy evaluation (pg. 385)
17.4 Planning algorithms (pg. 387)
17.4.1 Value iteration (pg. 387)
17.4.2 Policy iteration (pg. 390)
17.4.3 Linear programming (pg. 392)
17.5 Learning algorithms (pg. 393)
17.5.1 Stochastic approximation (pg. 394)
17.5.2 TD(0) algorithm (pg. 397)
17.5.3 Q-learning algorithm (pg. 398)
17.5.4 SARSA (pg. 402)
17.5.5 TD(λ) algorithm (pg. 402)
17.5.6 Large state space (pg. 403)
17.6 Chapter notes (pg. 405)
Conclusion (pg. 407)
A. Linear Algebra Review (pg. 409)
A.1 Vectors and norms (pg. 409)
A.1.1 Norms (pg. 409)
A.1.2 Dual norms (pg. 410)
A.1.3 Relationship between norms (pg. 411)
A.2 Matrices (pg. 411)
A.2.1 Matrix norms (pg. 411)
A.2.2 Singular value decomposition (pg. 412)
A.2.3 Symmetric positive semidefinite (SPSD) matrices (pg. 412)
B. Convex Optimization (pg. 415)
B.1 Differentiation and unconstrained optimization (pg. 415)
B.2 Convexity (pg. 415)
B.3 Constrained optimization (pg. 419)
B.4 Fenchel duality (pg. 422)
B.4.1 Subgradients (pg. 422)
B.4.2 Core (pg. 423)
B.4.3 Conjugate functions (pg. 423)
B.5 Chapter notes (pg. 426)
B.6 Exercises (pg. 427)
C. Probability Review (pg. 429)
C.1 Probability (pg. 429)
C.2 Random variables (pg. 429)
C.3 Conditional probability and independence (pg. 431)
C.4 Expectation and Markov’s inequality (pg. 431)
C.5 Variance and Chebyshev’s inequality (pg. 432)
C.6 Momentgenerating functions (pg. 434)
C.7 Exercises (pg. 435)
D. Concentration Inequalities (pg. 437)
D.1 Hoeffding’s inequality (pg. 437)
D.2 Sanov’s theorem (pg. 438)
D.3 Multiplicative Chernoff bounds (pg. 439)
D.4 Binomial distribution tails: Upper bounds (pg. 440)
D.5 Binomial distribution tails: Lower bound (pg. 440)
D.6 Azuma’s inequality (pg. 441)
D.7 McDiarmid’s inequality (pg. 442)
D.8 Normal distribution tails: Lower bound (pg. 443)
D.9 Khintchine-Kahane inequality (pg. 443)
D.10 Maximal inequality (pg. 444)
D.11 Chapter notes (pg. 445)
D.12 Exercises (pg. 445)
E. Notions of Information Theory (pg. 449)
E.1 Entropy (pg. 449)
E.2 Relative entropy (pg. 450)
E.3 Mutual information (pg. 453)
E.4 Bregman divergences (pg. 453)
E.5 Chapter notes (pg. 456)
E.6 Exercises (pg. 457)
F. Notation (pg. 459)
Bibliography (pg. 461)
Index (pg. 475)

Afshin Rostamizadeh

Afshin Rostamizadeh is a Research Scientist at Google Research.


Ameet Talwalkar

Ameet Talwalkar is a National Science Foundation Postdoctoral Fellow in the Department of Electrical Engineering and Computer Science at the University of California, Berkeley.


Mehryar Mohri

Mehryar Mohri is Professor of Computer Science at New York University’s Courant Institute of Mathematical Sciences and a Research Consultant at Google Research.


Instructors
You must have an instructor account and submit a request to access instructor materials for this book.
eTextbook
Go paperless today! Available online anytime, nothing to download or install.

Features

  • Bookmarking
  • Note taking
  • Highlighting
Support