Introduction to Machine Learning, 4e
by Alpaydin
ISBN: 9780262043793 | Copyright 2020
Instructor Requests
A substantially revised fourth edition of a comprehensive textbook, including new coverage of recent advances in deep learning and neural networks.
The goal of machine learning is to program computers to use example data or past experience to solve a given problem. Machine learning underlies such exciting new technologies as self-driving cars, speech recognition, and translation applications. This substantially revised fourth edition of a comprehensive, widely used machine learning textbook offers new coverage of recent advances in the field in both theory and practice, including developments in deep learning and neural networks.
The book covers a broad array of topics not usually included in introductory machine learning texts, including supervised learning, Bayesian decision theory, parametric methods, semiparametric methods, nonparametric methods, multivariate analysis, hidden Markov models, reinforcement learning, kernel machines, graphical models, Bayesian estimation, and statistical testing. The fourth edition offers a new chapter on deep learning that discusses training, regularizing, and structuring deep neural networks such as convolutional and generative adversarial networks; new material in the chapter on reinforcement learning that covers the use of deep networks, the policy gradient methods, and deep reinforcement learning; new material in the chapter on multilayer perceptrons on autoencoders and the word2vec network; and discussion of a popular method of dimensionality reduction, t-SNE. New appendixes offer background material on linear algebra and optimization. End-of-chapter exercises help readers to apply concepts learned. Introduction to Machine Learning can be used in courses for advanced undergraduate and graduate students and as a reference for professionals.
Expand/Collapse All | |
---|---|
Contents (pg. vii) | |
Preface (pg. xix) | |
Notations (pg. xxiii) | |
1. Introduction (pg. 1) | |
1.1 What Is Machine Learning (pg. 1) | |
1.2 Examples of Machine Learning Applications (pg. 4) | |
1.2.1 Association Rules (pg. 4) | |
1.2.2 Classification (pg. 4) | |
1.2.3 Regression (pg. 9) | |
1.2.4 Unsupervised Learning (pg. 11) | |
1.2.5 Reinforcement Learning (pg. 12) | |
1.3 History (pg. 13) | |
1.4 Related Topics (pg. 15) | |
1.4.1 High-Performance Computing (pg. 15) | |
1.4.2 Data Privacy and Security (pg. 16) | |
1.4.3 Model Interpretability and Trust (pg. 17) | |
1.4.4 Data Science (pg. 18) | |
1.5 Exercises (pg. 18) | |
1.6 References (pg. 20) | |
2. Supervised Learning (pg. 23) | |
2.1 Learning a Class from Examples (pg. 23) | |
2.2 Vapnik-Chervonenkis Dimension (pg. 29) | |
2.3 Probably Approximately Correct Learning (pg. 31) | |
2.4 Noise (pg. 32) | |
2.5 Learning Multiple Classes (pg. 34) | |
2.6 Regression (pg. 36) | |
2.7 Model Selection and Generalization (pg. 39) | |
2.8 Dimensions of a Supervised Machine Learning Algorithm (pg. 43) | |
2.9 Notes (pg. 44) | |
2.10 Exercises (pg. 45) | |
2.11 References (pg. 49) | |
3. Bayesian Decision Theory (pg. 51) | |
3.1 Introduction (pg. 51) | |
3.2 Classification (pg. 53) | |
3.3 Losses and Risks (pg. 55) | |
3.4 Discriminant Functions (pg. 57) | |
3.5 Association Rules (pg. 58) | |
3.6 Notes (pg. 61) | |
3.7 Exercises (pg. 62) | |
3.8 References (pg. 66) | |
4. Parametric Methods (pg. 67) | |
4.1 Introduction (pg. 67) | |
4.2 Maximum Likelihood Estimation (pg. 68) | |
4.2.1 Bernoulli Density (pg. 69) | |
4.2.2 Multinomial Density (pg. 70) | |
4.2.3 Gaussian (Normal) Density (pg. 70) | |
4.3 Evaluating an Estimator: Bias and Variance& (pg. 71) | |
4.4 The Bayes’ Estimator (pg. 72) | |
4.5 Parametric Classification (pg. 75) | |
4.6 Regression (pg. 79) | |
4.7 Tuning Model Complexity: Bias/Variance Dilemma& (pg. 82) | |
4.8 Model Selection Procedures (pg. 85) | |
4.9 Notes (pg. 89) | |
4.10 Exercises (pg. 90) | |
4.11 References (pg. 93) | |
5. Multivariate Methods (pg. 95) | |
5.1 Multivariate Data (pg. 95) | |
5.2 Parameter Estimation (pg. 96) | |
5.3 Estimation of Missing Values (pg. 97) | |
5.4 Multivariate Normal Distribution (pg. 98) | |
5.5 Multivariate Classification (pg. 102) | |
5.6 Tuning Complexity (pg. 108) | |
5.7 Discrete Features (pg. 110) | |
5.8 Multivariate Regression (pg. 111) | |
5.9 Notes (pg. 113) | |
5.10 Exercises (pg. 114) | |
5.11 References (pg. 116) | |
6. Dimensionality Reduction (pg. 117) | |
6.1 Introduction (pg. 117) | |
6.2 Subset Selection (pg. 118) | |
6.3 Principal Component Analysis (pg. 122) | |
6.4 Feature Embedding (pg. 129) | |
6.5 Factor Analysis (pg. 132) | |
6.6 Singular Value Decomposition and Matrix Factorization (pg. 137) | |
6.7 Multidimensional Scaling (pg. 138) | |
6.8 Linear Discriminant Analysis (pg. 142) | |
6.9 Canonical Correlation Analysis (pg. 147) | |
6.10 Isomap (pg. 150) | |
6.11 Locally Linear Embedding (pg. 152) | |
6.12 Laplacian Eigenmaps (pg. 154) | |
6.13 t-Distributed Stochastic Neighbor Embedding (pg. 157) | |
6.14 Notes (pg. 159) | |
6.15 Exercises (pg. 161) | |
6.16 References (pg. 162) | |
7. Clustering (pg. 165) | |
7.1 Introduction (pg. 165) | |
7.2 Mixture Densities (pg. 166) | |
7.3 k-Means Clustering (pg. 167) | |
7.4 Expectation-Maximization Algorithm (pg. 171) | |
7.5 Mixtures of Latent Variable Models (pg. 176) | |
7.6 Supervised Learning after Clustering (pg. 177) | |
7.7 Spectral Clustering (pg. 179) | |
7.8 Hierarchical Clustering (pg. 180) | |
7.9 Choosing the Number of Clusters (pg. 183) | |
7.10 Notes (pg. 183) | |
7.11 Exercises (pg. 184) | |
7.12 References (pg. 186) | |
8. Nonparametric Methods (pg. 189) | |
8.1 Introduction (pg. 189) | |
8.2 Nonparametric Density Estimation (pg. 190) | |
8.2.1 Histogram Estimator (pg. 191) | |
8.2.2 Kernel Estimator (pg. 192) | |
8.2.3 k-Nearest Neighbor Estimator (pg. 194) | |
8.3 Generalization to Multivariate Data (pg. 196) | |
8.4 Nonparametric Classification (pg. 197) | |
8.5 Condensed Nearest Neighbor (pg. 198) | |
8.6 Distance-Based Classification (pg. 200) | |
8.7 Outlier Detection (pg. 203) | |
8.8 Nonparametric Regression: Smoothing Models& (pg. 205) | |
8.8.1 Running Mean Smoother (pg. 205) | |
8.8.2 Kernel Smoother (pg. 207) | |
8.8.3 Running Line Smoother (pg. 208) | |
8.9 How to Choose the Smoothing Parameter (pg. 208) | |
8.10 Notes (pg. 209) | |
8.11 Exercises (pg. 212) | |
8.12 References (pg. 214) | |
9. Decision Trees (pg. 217) | |
9.1 Introduction (pg. 217) | |
9.2 Univariate Trees (pg. 219) | |
9.2.1 Classification Trees (pg. 220) | |
9.2.2 Regression Trees (pg. 224) | |
9.3 Pruning (pg. 226) | |
9.4 Rule Extraction from Trees (pg. 229) | |
9.5 Learning Rules from Data (pg. 230) | |
9.6 Multivariate Trees (pg. 234) | |
9.7 Notes (pg. 236) | |
9.8 Exercises (pg. 239) | |
9.9 References (pg. 241) | |
10. Linear Discrimination (pg. 243) | |
10.1 Introduction (pg. 243) | |
10.2 Generalizing the Linear Model (pg. 245) | |
10.3 Geometry of the Linear Discriminant (pg. 246) | |
10.3.1 Two Classes (pg. 246) | |
10.3.2 Multiple Classes (pg. 248) | |
10.4 Pairwise Separation (pg. 250) | |
10.5 Parametric Discrimination Revisited (pg. 251) | |
10.6 Gradient Descent (pg. 252) | |
10.7 Logistic Discrimination (pg. 254) | |
10.7.1 Two Classes (pg. 254) | |
10.7.2 Multiple Classes (pg. 257) | |
10.7.3 Multiple Labels (pg. 263) | |
10.8 Learning to Rank (pg. 264) | |
10.9 Notes (pg. 265) | |
10.10 Exercises (pg. 267) | |
10.11 References (pg. 269) | |
11. Multilayer Perceptrons (pg. 271) | |
11.1 Introduction (pg. 271) | |
11.1.1 Understanding the Brain (pg. 272) | |
11.1.2 Neural Networks as a Paradigm for Parallel Processing (pg. 273) | |
11.2 The Perceptron (pg. 275) | |
11.3 Training a Perceptron (pg. 278) | |
11.4 Learning Boolean Functions (pg. 282) | |
11.5 Multilayer Perceptrons (pg. 283) | |
11.6 MLP as a Universal Approximator (pg. 286) | |
11.7 Backpropagation Algorithm (pg. 288) | |
11.7.1 Nonlinear Regression (pg. 288) | |
11.7.2 Two-Class Discrimination (pg. 291) | |
11.7.3 Multiclass Discrimination (pg. 292) | |
11.7.4 Multilabel Discrimination (pg. 294) | |
11.8 Overtraining (pg. 295) | |
11.9 Learning Hidden Representations (pg. 296) | |
11.10 Autoencoders (pg. 301) | |
11.11 Word2vec Architecture (pg. 303) | |
11.12 Notes (pg. 307) | |
11.13 Exercises (pg. 309) | |
11.14 References (pg. 310) | |
12. Deep Learning (pg. 313) | |
12.1 Introduction (pg. 313) | |
12.2 How to Train Multiple Hidden Layers (pg. 317) | |
12.2.1 Rectified Linear Unit (pg. 317) | |
12.2.2 Initialization (pg. 317) | |
12.2.3 Generalizing Backpropagation to Multiple Hidden Layers (pg. 318) | |
12.3 Improving Training Convergence (pg. 321) | |
12.3.1 Momentum (pg. 321) | |
12.3.2 Adaptive Learning Factor (pg. 322) | |
12.3.3 Batch Normalization (pg. 323) | |
12.4 Regularization (pg. 325) | |
12.4.1 Hints (pg. 325) | |
12.4.2 Weight Decay (pg. 327) | |
12.4.3 Dropout (pg. 330) | |
12.5 Convolutional Layers (pg. 331) | |
12.5.1 The Idea (pg. 331) | |
12.5.2 Formalization (pg. 333) | |
12.5.3 Examples: LeNet-5 and AlexNet (pg. 337) | |
12.5.4 Extensions (pg. 338) | |
12.5.5 Multimodal Deep Networks (pg. 340) | |
12.6 Tuning the Network Structure (pg. 340) | |
12.6.1 Structure and Hyperparameter Search (pg. 340) | |
12.6.2 Skip Connections (pg. 342) | |
12.6.3 Gating Units (pg. 343) | |
12.7 Learning Sequences (pg. 344) | |
12.7.1 Example Tasks (pg. 344) | |
12.7.2 Time-Delay Neural Networks (pg. 345) | |
12.7.3 Recurrent Networks (pg. 345) | |
12.7.4 Long Short-Term Memory Unit (pg. 348) | |
12.7.5 Gated Recurrent Unit (pg. 349) | |
12.8 Generative Adversarial Network (pg. 350) | |
12.9 Notes (pg. 353) | |
12.10 Exercises (pg. 354) | |
12.11 References (pg. 356) | |
13. Local Models (pg. 361) | |
13.1 Introduction (pg. 361) | |
13.2 Competitive Learning (pg. 362) | |
13.2.1 Online k-Means (pg. 362) | |
13.2.2 Adaptive Resonance Theory (pg. 367) | |
13.2.3 Self-Organizing Maps (pg. 368) | |
13.3 Radial Basis Functions (pg. 370) | |
13.4 Incorporating Rule-Based Knowledge (pg. 376) | |
13.5 Normalized Basis Functions (pg. 377) | |
13.6 Competitive Basis Functions (pg. 379) | |
13.7 Learning Vector Quantization (pg. 382) | |
13.8 The Mixture of Experts (pg. 382) | |
13.8.1 Cooperative Experts (pg. 385) | |
13.8.2 Competitive Experts (pg. 386) | |
13.9 Hierarchical Mixture of Experts and Soft Decision Trees (pg. 386) | |
13.10 Notes (pg. 388) | |
13.11 Exercises (pg. 389) | |
13.12 References (pg. 392) | |
14. Kernel Machines (pg. 395) | |
14.1 Introduction (pg. 395) | |
14.2 Optimal Separating Hyperplane (pg. 397) | |
14.3 The Nonseparable Case: Soft Margin Hyperplane& (pg. 401) | |
14.4 v-SVM (pg. 404) | |
14.5 Kernel Trick (pg. 405) | |
14.6 Vectorial Kernels (pg. 407) | |
14.7 Defining Kernels (pg. 410) | |
14.8 Multiple Kernel Learning (pg. 411) | |
14.9 Multiclass Kernel Machines (pg. 413) | |
14.10 Kernel Machines for Regression (pg. 414) | |
14.11 Kernel Machines for Ranking (pg. 419) | |
14.12 One-Class Kernel Machines (pg. 420) | |
14.13 Large Margin Nearest Neighbor Classifier& (pg. 423) | |
14.14 Kernel Dimensionality Reduction (pg. 425) | |
14.15 Notes (pg. 426) | |
14.16 Exercises (pg. 428) | |
14.17 References (pg. 429) | |
15. Graphical Models (pg. 433) | |
15.1 Introduction (pg. 433) | |
15.2 Canonical Cases for Conditional Independence (pg. 435) | |
15.3 Generative Models (pg. 442) | |
15.4 d-Separation (pg. 445) | |
15.5 Belief Propagation (pg. 445) | |
15.5.1 Chains (pg. 446) | |
15.5.2 Trees (pg. 448) | |
15.5.3 Polytrees (pg. 450) | |
15.5.4 Junction Trees (pg. 452) | |
15.6 Undirected Graphs: Markov Random Fields (pg. 453) | |
15.7 Learning the Structure of a Graphical Model (pg. 456) | |
15.8 Influence Diagrams (pg. 457) | |
15.9 Notes (pg. 458) | |
15.10 Exercises (pg. 459) | |
15.11 References (pg. 461) | |
16. Hidden Markov Models (pg. 463) | |
16.1 Introduction (pg. 463) | |
16.2 Discrete Markov Processes (pg. 464) | |
16.3 Hidden Markov Models (pg. 467) | |
16.4 Three Basic Problems of HMMs (pg. 469) | |
16.5 Evaluation Problem (pg. 469) | |
16.6 Finding the State Sequence (pg. 473) | |
16.7 Learning Model Parameters (pg. 475) | |
16.8 Continuous Observations (pg. 478) | |
16.9 The HMM as a Graphical Model (pg. 479) | |
16.10 Model Selection in HMMs (pg. 482) | |
16.11 Notes (pg. 484) | |
16.12 Exercises (pg. 486) | |
16.13 References (pg. 489) | |
17. Bayesian Estimation (pg. 491) | |
17.1 Introduction (pg. 491) | |
17.2 Bayesian Estimation of the Parameters of a Discrete Distribution (pg. 495) | |
17.2.1 K > 2 States: Dirichlet Distribution (pg. 495) | |
17.2.2 K = 2 States: Beta Distribution (pg. 496) | |
17.3 Bayesian Estimation of the Parameters of a Gaussian Distribution (pg. 497) | |
17.3.1 Univariate Case: Unknown Mean, Known Variance (pg. 497) | |
17.3.2 Univariate Case: Unknown Mean, Unknown Variance (pg. 499) | |
17.3.3 Multivariate Case: Unknown Mean, Unknown Covariance (pg. 501) | |
17.4 Bayesian Estimation of the Parameters of a Function (pg. 502) | |
17.4.1 Regression (pg. 502) | |
17.4.2 Regression with Prior on Noise Precision (pg. 506) | |
17.4.3 The Use of Basis/Kernel Functions (pg. 507) | |
17.4.4 Bayesian Classification (pg. 509) | |
17.5 Choosing a Prior (pg. 512) | |
17.6 Bayesian Model Comparison (pg. 513) | |
17.7 Bayesian Estimation of a Mixture Model (pg. 516) | |
17.8 Nonparametric Bayesian Modeling (pg. 519) | |
17.9 Gaussian Processes (pg. 520) | |
17.10 Dirichlet Processes and Chinese Restaurants (pg. 524) | |
17.11 Latent Dirichlet Allocation (pg. 526) | |
17.12 Beta Processes and Indian Buffets (pg. 528) | |
17.13 Notes (pg. 529) | |
17.14 Exercises (pg. 530) | |
17.15 References (pg. 531) | |
18. Combining Multiple Learners (pg. 533) | |
18.1 Rationale (pg. 533) | |
18.2 Generating Diverse Learners (pg. 534) | |
18.3 Model Combination Schemes (pg. 537) | |
18.4 Voting (pg. 538) | |
18.5 Error-Correcting Output Codes (pg. 542) | |
18.6 Bagging (pg. 544) | |
18.7 Boosting (pg. 545) | |
18.8 The Mixture of Experts Revisited (pg. 548) | |
18.9 Stacked Generalization (pg. 550) | |
18.10 Fine-Tuning an Ensemble (pg. 551) | |
18.10.1 Choosing a Subset of the Ensemble (pg. 552) | |
18.10.2 Constructing Metalearners (pg. 552) | |
18.11 Cascading (pg. 553) | |
18.12 Notes (pg. 555) | |
18.13 Exercises (pg. 557) | |
18.14 References (pg. 559) | |
19. Reinforcement Learning (pg. 563) | |
19.1 Introduction (pg. 563) | |
19.2 Single State Case: K-Armed Bandit (pg. 565) | |
19.3 Elements of Reinforcement Learning (pg. 566) | |
19.4 Model-Based Learning (pg. 569) | |
19.4.1 Value Iteration (pg. 569) | |
19.4.2 Policy Iteration (pg. 570) | |
19.5 Temporal Difference Learning (pg. 571) | |
19.5.1 Exploration Strategies (pg. 571) | |
19.5.2 Deterministic Rewards and Actions (pg. 572) | |
19.5.3 Nondeterministic Rewards and Actions (pg. 573) | |
19.5.4 Eligibility Traces (pg. 576) | |
19.6 Generalization (pg. 577) | |
19.7 Partially Observable States (pg. 580) | |
19.7.1 The Setting (pg. 580) | |
19.7.2 Example: The Tiger Problem (pg. 582) | |
19.8 Deep Q Learning (pg. 587) | |
19.9 Policy Gradients (pg. 588) | |
19.10 Learning to Play Backgammon and Go (pg. 591) | |
19.11 Notes (pg. 592) | |
19.12 Exercises (pg. 593) | |
19.13 References (pg. 595) | |
20. Design and Analysis of Machine Learning Experiments (pg. 597) | |
20.1 Introduction (pg. 597) | |
20.2 Factors, Response, and Strategy of Experimentation (pg. 600) | |
20.3 Response Surface Design (pg. 603) | |
20.4 Randomization, Replication, and Blocking (pg. 604) | |
20.5 Guidelines for Machine Learning Experiments (pg. 605) | |
20.6 CrossValidation and Resampling Methods (pg. 608) | |
20.6.1 K-Fold Cross-Validation (pg. 609) | |
20.6.2 5 × 2 Cross-Validation (pg. 610) | |
20.6.3 Bootstrapping (pg. 611) | |
20.7 Measuring Classifier Performance (pg. 611) | |
20.8 Interval Estimation (pg. 614) | |
20.9 Hypothesis Testing (pg. 618) | |
20.10 Assessing a Classification Algorithm’s Performance (pg. 620) | |
20.10.1 Binomial Test (pg. 621) | |
20.10.2 Approximate Normal Test (pg. 622) | |
20.10.3 t Test (pg. 622) | |
20.11 Comparing Two Classification Algorithms (pg. 623) | |
20.11.1 McNemar’s Test (pg. 623) | |
20.11.2 K-Fold Cross-Validated Paired t Test (pg. 623) | |
20.11.3 5 × 2 cv Paired t Test (pg. 624) | |
20.11.4 5 × 2 cv Paired F Test (pg. 625) | |
20.12 Comparing Multiple Algorithms: Analysis of Variance (pg. 626) | |
20.13 Comparison over Multiple Datasets (pg. 630) | |
20.13.1 Comparing Two Algorithms (pg. 631) | |
20.13.2 Multiple Algorithms (pg. 633) | |
20.14 Multivariate Tests (pg. 634) | |
20.14.1 Comparing Two Algorithms (pg. 635) | |
20.14.2 Comparing Multiple Algorithms (pg. 636) | |
20.15 Notes (pg. 637) | |
20.16 Exercises (pg. 638) | |
20.17 References (pg. 640) | |
A. Probability (pg. 643) | |
A.1 Elements of Probability (pg. 643) | |
A.1.1 Axioms of Probability (pg. 644) | |
A.1.2 Conditional Probability (pg. 644) | |
A.2 Random Variables (pg. 645) | |
A.2.1 Probability Distribution and Density Functions (pg. 645) | |
A.2.2 Joint Distribution and Density Functions& (pg. 646) | |
A.2.3 Conditional Distributions (pg. 646) | |
A.2.4 Bayes’ Rule (pg. 647) | |
A.2.5 Expectation (pg. 647) | |
A.2.6 Variance (pg. 648) | |
A.2.7 Weak Law of Large Numbers (pg. 649) | |
A.3 Special Random Variables (pg. 649) | |
A.3.1 Bernoulli Distribution (pg. 649) | |
A.3.2 Binomial Distribution (pg. 650) | |
A.3.3 Multinomial Distribution (pg. 650) | |
A.3.4 Uniform Distribution (pg. 650) | |
A.3.5 Normal (Gaussian) Distribution (pg. 651) | |
A.3.6 Chi-Square Distribution (pg. 652) | |
A.3.7 t Distribution (pg. 653) | |
A.3.8 F Distribution (pg. 653) | |
A.4 References (pg. 653) | |
B. Linear Algebra (pg. 655) | |
B.1 Vectors (pg. 655) | |
B.2 Matrices (pg. 657) | |
B.3 Similarity of Vectors (pg. 658) | |
B.4 Square Matrices (pg. 659) | |
B.5 Linear Dependence and Ranks (pg. 659) | |
B.6 Inverses (pg. 660) | |
B.7 Positive Definite Matrices (pg. 660) | |
B.8 Trace and Determinant (pg. 660) | |
B.9 Eigenvalues and Eigenvectors (pg. 661) | |
B.10 Spectral Decomposition (pg. 662) | |
B.11 Singular Value Decomposition (pg. 662) | |
B.12 References (pg. 663) | |
C. Optimization (pg. 665) | |
C.1 Introduction (pg. 665) | |
C.2 Linear Optimization (pg. 667) | |
C.3 Convex Optimization (pg. 667) | |
C.4 Duality (pg. 668) | |
C.5 Local Optimization (pg. 670) | |
C.6 References (pg. 671) | |
Index (pg. 673) |
Ethem Alpaydin
Ethem Alpaydin is Professor in the Department of Computer Engineering at Özyegin University and Member of The Science Academy, Istanbul. He is the author of Machine Learning: The New AI, a volume in the MIT Press Essential Knowledge series.s).Instructors Only | |
---|---|
You must have an instructor account and submit a request to access instructor materials for this book.
|
eTextbook
Go paperless today! Available online anytime, nothing to download or install.
Features
|