Veridical Data Science

The Practice of Responsible Data Analysis and Decision Making

by Yu, Barter

| ISBN: 9780262049191 | Copyright 2024

Click here to preview

Instructor Requests

Digital Exam/Desk Copy Print Desk Copy Ancillaries
Tabs

Using real-world data case studies, this innovative and accessible textbook introduces an actionable framework for conducting trustworthy data science.

Most textbooks present data science as a linear analytic process involving a set of statistical and computational techniques without accounting for the challenges intrinsic to real-world applications. Veridical Data Science, by contrast, embraces the reality that most projects begin with an ambiguous domain question and messy data; it acknowledges that datasets are mere approximations of reality while analyses are mental constructs.
Bin Yu and Rebecca Barter employ the innovative Predictability, Computability, and Stability (PCS) framework to assess the trustworthiness and relevance of data-driven results relative to three sources of uncertainty that arise throughout the data science life cycle: the human decisions and judgment calls made during data collection, cleaning, and modeling. By providing real-world data case studies, intuitive explanations of common statistical and machine learning techniques, and supplementary R and Python code, Veridical Data Science offers a clear and actionable guide for conducting responsible data science. Requiring little background knowledge, this lucid, self-contained textbook provides a solid foundation and principled framework for future study of advanced methods in machine learning, statistics, and data science. 

•Presents the Predictability, Computability, and Stability (PCS) methodology for producing trustworthy data-driven results
•Teaches how a data science project should be conducted from beginning to end, including extensive discussion of the data scientist's decision-making process
•Cultivates critical thinking throughout the entire data science life cycle
•Provides practical examples and illuminating case studies of real-world data analysis problems with associated code, exercises, and solutions
•Suitable for advanced undergraduate and graduate students, domain scientists, and practitioners

Expand/Collapse All
Contents (pg. vii)
Preface (pg. xv)
0.1. What Is Veridical Data Science? (pg. xix)
0.2. The Structure of This Book (pg. xxi)
Acknowledgments (pg. xxvii)
I. An Introduction to Veridical Data Science (pg. 1)
1. An Introduction to Veridical Data Science (pg. 3)
1.1 The Role of Data and Algorithms in Real-World Decision Making (pg. 3)
1.2 Evaluating and Building Trustworthiness Using Critical Thinking (pg. 6)
1.3 Evaluating and Building Trustworthiness Using the PCS Framework (pg. 11)
2. The Data Science Life Cycle (pg. 23)
2.1 Data Terminology (pg. 24)
2.2 DSLC Stage 1: Problem Formulation and Data Collection (pg. 26)
2.3 DSLC Stage 2: Data Cleaning and Exploratory Data Analysis (pg. 30)
2.4 DSLC Stage 3: Uncovering Intrinsic Data Structures (pg. 33)
2.5 DSLC Stage 4: Predictive and/or Inferential Analysis (pg. 34)
2.6 DSLC Stage 5: Evaluation of Results (pg. 36)
2.7 DSLC Stage 6: Communication of Results and Updating Domain Knowledge (pg. 37)
Exercises (pg. 38)
3. Setting Up Your Data Science Project (pg. 41)
3.1 Programming Languages and IDEs (pg. 41)
3.2 A Consistent Project Structure (pg. 45)
3.3 Reproducibility (pg. 51)
3.4 Tools for Collaboration (pg. 56)
Exercises (pg. 58)
II. Preparing, Exploring, and Describing Data (pg. 63)
4. Data Preparation (pg. 65)
4.1 The Organ Donation Data (pg. 70)
4.2 A Generalizable Data Cleaning Procedure (pg. 71)
4.3 Step 1: Learn About the Data Collection Process and the Problem Domain (pg. 74)
4.4 Step 2: Load the Data (pg. 77)
4.5 Step 3: Examine the Data and Create Action Items (pg. 78)
4.6 Step 4: Clean the Data (pg. 98)
4.7 Additional Common Preprocessing Steps (pg. 101)
Exercises (pg. 101)
5. Exploratory Data Analysis (pg. 107)
5.1 A Question-and-Answer-Based Exploratory Data Analysis Workflow (pg. 109)
5.2 Common Explorations (pg. 119)
5.3 Comparability (pg. 131)
5.4 PCS Scrutinization of Exploratory Results (pg. 133)
Exercises (pg. 139)
6. Principal Component Analysis (pg. 147)
6.1 The Nutrition Project (pg. 149)
6.2 Generating Summary Variables: Principal Component Analysis (pg. 156)
6.3 Preprocessing: Standardization for Comparability (pg. 162)
6.4 Singular Value Decomposition (pg. 164)
6.5 Preprocessing: Gaussianity and Transformations (pg. 173)
6.6 Principal Component Analysis Step-by-Step Summary (pg. 178)
6.7 PCS Evaluation of Principal Component Analysis (pg. 179)
6.8 Applying Principal Component Analysis to Each Nutrient Group (pg. 186)
6.9 Alternatives to Principal Component Analysis (pg. 189)
Exercises (pg. 189)
7. Clustering (pg. 195)
7.1 Understanding Clusters (pg. 197)
7.2 Hierarchical Clustering (pg. 204)
7.3 K-Means Clustering (pg. 211)
7.4 Visualizing Clusters in High Dimensions (pg. 215)
7.5 Quantitative Measures of Cluster Quality (pg. 220)
7.6 The Rand Index for Comparing Cluster Similarity (pg. 228)
7.7 Choosing the Number of Clusters (pg. 232)
7.8 PCS Scrutinization of Cluster Results (pg. 238)
7.9 The Final Clusters (pg. 243)
Exercises (pg. 246)
III. Prediction (pg. 251)
8. An Introduction to Prediction Problems (pg. 253)
8.1 Connecting the Past, Present, and Future for Prediction Problems (pg. 255)
8.2 Setting up a Prediction Problem (pg. 258)
8.3 PCS and Evaluating Prediction Algorithms (pg. 261)
8.4 The Ames House Price Prediction Project (pg. 263)
Exercises (pg. 270)
9. Continuous Responses and Least Squares (pg. 273)
9.1 Visualizing Predictive Relationships (pg. 273)
9.2 Using Fitted Lines to Generate Predictions (pg. 276)
9.3 Computing Fitted Lines (pg. 277)
9.4 Quantitative Measures of Predictive Performance (pg. 289)
9.5 PCS Scrutinization of Prediction Results (pg. 297)
Exercises (pg. 302)
10. Extending the Least Squares Algorithm (pg. 307)
10.1 Linear Fits with Multiple Predictive Features (pg. 307)
10.2 Pre-processing: One-Hot-Encoding (pg. 312)
10.3 Pre-processing: Variable Transformations (pg. 316)
10.4 Feature Selection (pg. 319)
10.5 Regularization (pg. 321)
10.6 PCS Evaluations (pg. 333)
10.7 Appendix: Matrix Formulation of a Linear Fit (pg. 343)
Exercises (pg. 344)
11. Binary Responses and Logistic Regression (pg. 349)
11.1 The Online Shopping Purchase Prediction Project (pg. 349)
11.2 Least Squares for Binary Prediction (pg. 357)
11.3 Logistic Regression (pg. 358)
11.4 Quantitative Measures of Binary Predictive Performance (pg. 369)
11.5 PCS Scrutinization of Binary Prediction Results (pg. 382)
Exercises (pg. 392)
12. Decision Trees and the Random Forest Algorithm (pg. 397)
12.1 Decision Trees (pg. 397)
12.2 The Classification and Regression Trees Algorithm (pg. 400)
12.3 The Random Forest Algorithm (pg. 410)
12.4 Random Forest Feature Importance Measures (pg. 413)
12.5 PCS Evaluation of the CART and RF Algorithms (pg. 418)
Exercises (pg. 425)
13. Producing the Final Prediction Results (pg. 429)
13.1 Approach 1: Choosing a Single Predictive Fit Using PCS (pg. 432)
13.2 Approach 2: PCS Ensemble (pg. 441)
13.3 Approach 3: Calibrated PCS Prediction Perturbation Intervals (pg. 447)
13.4 Choosing the Final Prediction Approach (pg. 456)
13.5 Using Your Predictions in the Real World (pg. 457)
Exercises (pg. 457)
14. Conclusion (pg. 463)
14.1 Predictability (pg. 464)
14.2 Stability and Uncertainty (pg. 465)
14.3 Future PCS Directions: Inference (pg. 468)
14.4 Farewell (pg. 469)
Answers to True or False Exercises (pg. 471)
Chapter 1 (pg. 471)
Chapter 2 (pg. 471)
Chapter 3 (pg. 472)
Chapter 4 (pg. 473)
Chapter 5 (pg. 474)
Chapter 6 (pg. 475)
Chapter 7 (pg. 475)
Chapter 8 (pg. 476)
Chapter 9 (pg. 477)
Chapter 10 (pg. 478)
Chapter 11 (pg. 479)
Chapter 12 (pg. 480)
Chapter 13 (pg. 480)
References (pg. 483)
Index (pg. 489)

Bin Yu

Bin Yu is Chancellor's Distinguished Professor and Class of 1936 Second Chair in Statistics, EECS, and Computational Biology at the University of California, Berkeley, a 2006 Guggenheim Fellow, and a member of the US National Academy of Sciences and the American Academy of Arts and Sciences.

Rebecca L. Barter

Rebecca L. Barter is Research Assistant Professor in Epidemiology at the University of Utah.

Instructors Only
You must have an instructor account and submit a request to access instructor materials for this book.
eTextbook
Go paperless today! Available online anytime, nothing to download or install.

Features

  • Bookmarking
  • Note taking
  • Highlighting