Deep Learning
by Goodfellow, Bengio, Courville
ISBN: 9780262364102  Copyright 2016
Instructor Requests
“Written by three experts in the field, Deep Learning is the only comprehensive book on the subject.”
—Elon Musk, cochair of OpenAI; cofounder and CEO of Tesla and SpaceX
Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning.
The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models.
Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.
“Written by three experts in the field, Deep Learning is the only comprehensive book on the subject. It provides muchneeded broad perspective and mathematical preliminaries for software engineers and students entering the field, and serves as a reference for authorities.”
—Elon Musk, cochair of OpenAI; cofounder and CEO of Tesla and SpaceX
“This is the definitive textbook on deep learning. Written by major contributors to the field, it is clear, comprehensive, and authoritative. If you want to know where deep learning came from, what it is good for, and where it is going, read this book.”
—Geoffrey Hinton FRS, Emeritus Professor, University of Toronto; Distinguished Research Scientist, Google
“Deep learning has taken the world of technology by storm since the beginning of the decade. There was a need for a textbook for students, practitioners, and instructors that includes basic concepts, practical aspects, and advanced research topics. This is the first comprehensive textbook on the subject, written by some of the most innovative and prolific researchers in the field. This will be a reference for years to come.”
—Yann LeCun, Director of AI Research, Facebook; Silver Professor of Computer Science, Data Science, and Neuroscience, New York University
Expand/Collapse All  

Contents (pg. v)  
Website (pg. xiii)  
Acknowledgments (pg. xv)  
Notation (pg. xix)  
Introduction (pg. 1)  
Who Should Read This Book? (pg. 8)  
Historical Trends in Deep Learning (pg. 12)  
Applied Math and Machine Learning Basics (pg. 27)  
Linear Algebra (pg. 29)  
Scalars, Vectors, Matrices and Tensors (pg. 29)  
Multiplying Matrices and Vectors (pg. 32)  
Identity and Inverse Matrices (pg. 34)  
Linear Dependence and Span (pg. 35)  
Norms (pg. 36)  
Special Kinds of Matrices and Vectors (pg. 38)  
Eigendecomposition (pg. 39)  
Singular Value Decomposition (pg. 42)  
The MoorePenrose Pseudoinverse (pg. 43)  
The Trace Operator (pg. 44)  
The Determinant (pg. 45)  
Example: Principal Components Analysis (pg. 45)  
Probability and Information Theory (pg. 51)  
Why Probability? (pg. 52)  
Random Variables (pg. 54)  
Probability Distributions (pg. 54)  
Marginal Probability (pg. 56)  
Conditional Probability (pg. 57)  
The Chain Rule of Conditional Probabilities (pg. 57)  
Independence and Conditional Independence (pg. 58)  
Expectation, Variance and Covariance (pg. 58)  
Common Probability Distributions (pg. 60)  
Useful Properties of Common Functions (pg. 65)  
Bayes’ Rule (pg. 68)  
Technical Details of Continuous Variables (pg. 68)  
Information Theory (pg. 70)  
Structured Probabilistic Models (pg. 74)  
Numerical Computation (pg. 77)  
Overflow and Underflow (pg. 77)  
Poor Conditioning (pg. 79)  
GradientBased Optimization (pg. 79)  
Constrained Optimization (pg. 89)  
Example: Linear Least Squares (pg. 92)  
Machine Learning Basics (pg. 95)  
Learning Algorithms (pg. 96)  
Capacity, Overfitting and Underfitting (pg. 107)  
Hyperparameters and Validation Sets (pg. 117)  
Estimators, Bias and Variance (pg. 119)  
Maximum Likelihood Estimation (pg. 128)  
Bayesian Statistics (pg. 132)  
Supervised Learning Algorithms (pg. 136)  
Unsupervised Learning Algorithms (pg. 142)  
Stochastic Gradient Descent (pg. 147)  
Building a Machine Learning Algorithm (pg. 149)  
Challenges Motivating Deep Learning (pg. 151)  
Deep Networks: Modern Practices (pg. 161)  
Deep Feedforward Networks (pg. 163)  
Example: Learning XOR (pg. 166)  
GradientBased Learning (pg. 171)  
Hidden Units (pg. 185)  
Architecture Design (pg. 191)  
BackPropagation and Other Differentiation Algorithms (pg. 197)  
Historical Notes (pg. 217)  
Regularization for Deep Learning (pg. 221)  
Parameter Norm Penalties (pg. 223)  
Norm Penalties as Constrained Optimization (pg. 230)  
Regularization and UnderConstrained Problems (pg. 232)  
Dataset Augmentation (pg. 233)  
Noise Robustness (pg. 235)  
SemiSupervised Learning (pg. 236)  
Multitask Learning (pg. 237)  
Early Stopping (pg. 239)  
Parameter Tying and Parameter Sharing (pg. 246)  
Sparse Representations (pg. 247)  
Bagging and Other Ensemble Methods (pg. 249)  
Dropout (pg. 251)  
Adversarial Training (pg. 261)  
Tangent Distance, Tangent Prop and Manifold Tangent Classifier (pg. 263)  
Optimization for Training Deep Models (pg. 267)  
How Learning Differs from Pure Optimization (pg. 268)  
Challenges in Neural Network Optimization (pg. 275)  
Basic Algorithms (pg. 286)  
Parameter Initialization Strategies (pg. 292)  
Algorithms with Adaptive Learning Rates (pg. 298)  
Approximate SecondOrder Methods (pg. 302)  
Optimization Strategies and MetaAlgorithms (pg. 309)  
Convolutional Networks (pg. 321)  
The Convolution Operation (pg. 322)  
Motivation (pg. 324)  
Pooling (pg. 330)  
Convolution and Pooling as an Infinitely Strong Prior (pg. 334)  
Variants of the Basic Convolution Function (pg. 337)  
Structured Outputs (pg. 347)  
Data Types (pg. 348)  
Efficient Convolution Algorithms (pg. 350)  
Random or Unsupervised Features (pg. 351)  
The Neuroscientific Basis for Convolutional Networks (pg. 353)  
Convolutional Networks and the History of Deep Learning (pg. 359)  
Sequence Modeling: Recurrentand Recursive Nets (pg. 363)  
Unfolding Computational Graphs (pg. 365)  
Recurrent Neural Networks (pg. 368)  
Bidirectional RNNs (pg. 383)  
EncoderDecoder SequencetoSequence Architectures (pg. 385)  
Deep Recurrent Networks (pg. 387)  
Recursive Neural Networks (pg. 388)  
The Challenge of LongTerm Dependencies (pg. 390)  
Echo State Networks (pg. 392)  
Leaky Units and Other Strategies for MultipleTime Scales (pg. 395)  
The Long ShortTerm Memory and Other Gated RNNs (pg. 397)  
Optimization for LongTerm Dependencies (pg. 401)  
Explicit Memory (pg. 405)  
Practical Methodology (pg. 409)  
Performance Metrics (pg. 410)  
Default Baseline Models (pg. 413)  
Determining Whether to Gather More Data (pg. 414)  
Selecting Hyperparameters (pg. 415)  
Debugging Strategies (pg. 424)  
Example: MultiDigit Number Recognition (pg. 428)  
Applications (pg. 431)  
LargeScale Deep Learning (pg. 431)  
Computer Vision (pg. 440)  
Speech Recognition (pg. 446)  
Natural Language Processing (pg. 448)  
Other Applications (pg. 465)  
Deep Learning Research (pg. 475)  
Linear Factor Models (pg. 479)  
Probabilistic PCA and Factor Analysis (pg. 480)  
Independent Component Analysis (ICA) (pg. 481)  
Slow Feature Analysis (pg. 484)  
Sparse Coding (pg. 486)  
Manifold Interpretation of PCA (pg. 489)  
Autoencoders (pg. 493)  
Undercomplete Autoencoders (pg. 494)  
Regularized Autoencoders (pg. 495)  
Representational Power, Layer Size and Depth (pg. 499)  
Stochastic Encoders and Decoders (pg. 500)  
Denoising Autoencoders (pg. 501)  
Learning Manifolds with Autoencoders (pg. 506)  
Contractive Autoencoders (pg. 510)  
Predictive Sparse Decomposition (pg. 514)  
Applications of Autoencoders (pg. 515)  
Representation Learning (pg. 517)  
Greedy LayerWise Unsupervised Pretraining (pg. 519)  
Transfer Learning and Domain Adaptation (pg. 526)  
SemiSupervised Disentangling of Causal Factors (pg. 532)  
Distributed Representation (pg. 536)  
Exponential Gains from Depth (pg. 543)  
Providing Clues to Discover Underlying Causes (pg. 544)  
Structured Probabilistic Models for Deep Learning (pg. 549)  
The Challenge of Unstructured Modeling (pg. 550)  
Using Graphs to Describe Model Structure (pg. 554)  
Sampling from Graphical Models (pg. 570)  
Advantages of Structured Modeling (pg. 572)  
Learning about Dependencies (pg. 572)  
Inference and Approximate Inference (pg. 573)  
The Deep Learning Approach to Structured Probabilistic Models (pg. 575)  
Monte Carlo Methods (pg. 581)  
Sampling and Monte Carlo Methods (pg. 581)  
Importance Sampling (pg. 583)  
Markov Chain Monte Carlo Methods (pg. 586)  
Gibbs Sampling (pg. 590)  
The Challenge of Mixing between Separated Modes (pg. 591)  
Confronting the Partition Function (pg. 597)  
The LogLikelihood Gradient (pg. 598)  
Stochastic Maximum Likelihood and Contrastive Divergence (pg. 599)  
Pseudolikelihood (pg. 607)  
Score Matching and Ratio Matching (pg. 609)  
Denoising Score Matching (pg. 611)  
NoiseContrastive Estimation (pg. 612)  
Estimating the Partition Function (pg. 614)  
Approximate Inference (pg. 623)  
Inference as Optimization (pg. 624)  
Expectation Maximization (pg. 626)  
MAP Inference and Sparse Coding (pg. 627)  
Variational Inference and Learning (pg. 629)  
Learned Approximate Inference (pg. 642)  
Deep Generative Models (pg. 645)  
Boltzmann Machines (pg. 645)  
Restricted Boltzmann Machines (pg. 647)  
Deep Belief Networks (pg. 651)  
Deep Boltzmann Machines (pg. 654)  
Boltzmann Machines for RealValued Data (pg. 667)  
Convolutional Boltzmann Machines (pg. 673)  
Boltzmann Machines for Structured or Sequential Outputs (pg. 675)  
Other Boltzmann Machines (pg. 677)  
BackPropagation through Random Operations (pg. 678)  
Directed Generative Nets (pg. 682)  
Drawing Samples from Autoencoders (pg. 701)  
Generative Stochastic Networks (pg. 704)  
Other Generation Schemes (pg. 706)  
Evaluating Generative Models (pg. 707)  
Conclusion (pg. 710)  
Bibliography (pg. 711)  
Index (pg. 767) 
Ian Goodfellow
Ian Goodfellow is Research Scientist at OpenAI.
Yoshua Bengio
Yoshua Bengio is Professor of Computer Science at the Université de Montréal.
Aaron Courville
Aaron Courville is Assistant Professor of Computer Science at the Université de Montréal.
eTextbook
Go paperless today! Available online anytime, nothing to download or install.
