|
Cover |
1 |
|
|
Title Page |
5 |
|
|
Copyright |
6 |
|
|
Contents |
9 |
|
|
Foreword by Gareth James |
21 |
|
|
Foreword by Ravi Bapna |
23 |
|
|
Preface to the R Edition |
25 |
|
|
Acknowledgments |
29 |
|
|
PART I PRELIMINARIES |
33 |
|
|
CHAPTER 1 Introduction |
35 |
|
|
1.1 What Is Business Analytics? |
35 |
|
|
1.2 What Is Data Mining? |
37 |
|
|
1.3 Data Mining and Related Terms |
37 |
|
|
1.4 Big Data |
38 |
|
|
1.5 Data Science |
39 |
|
|
1.6 Why Are There So Many Different Methods? |
40 |
|
|
1.7 Terminology and Notation |
41 |
|
|
1.8 Road Maps to This Book |
43 |
|
|
Order of Topics |
43 |
|
|
CHAPTER 2 Overview of the Data Mining Process |
47 |
|
|
2.1 Introduction |
47 |
|
|
2.2 Core Ideas in Data Mining |
48 |
|
|
Classification |
48 |
|
|
Prediction |
48 |
|
|
Association Rules and Recommendation Systems |
48 |
|
|
Predictive Analytics |
49 |
|
|
Data Reduction and Dimension Reduction |
49 |
|
|
Data Exploration and Visualization |
49 |
|
|
Supervised and Unsupervised Learning |
50 |
|
|
2.3 The Steps in Data Mining |
51 |
|
|
2.4 Preliminary Steps |
53 |
|
|
Organization of Datasets |
53 |
|
|
Predicting Home Values in the West Roxbury Neighborhood |
53 |
|
|
Loading and Looking at the Data in R |
1 |
|
|
Sampling from a Database |
56 |
|
|
Oversampling Rare Events in Classification Tasks |
57 |
|
|
Preprocessing and Cleaning the Data |
1 |
|
|
2.5 Predictive Power and Overfitting |
65 |
|
|
Overfitting |
65 |
|
|
Creation and Use of Data Partitions |
67 |
|
|
2.6 Building a Predictive Model |
70 |
|
|
Modeling Process |
71 |
|
|
2.7 Using R for Data Mining on a Local Machine |
75 |
|
|
2.8 Automating Data Mining Solutions |
75 |
|
|
Data Mining Software: The State of the Market (by Herb Edelstein) |
77 |
|
|
Problems |
81 |
|
|
PART II DATA EXPLORATION AND DIMENSION REDUCTION |
85 |
|
|
CHAPTER 3 Data Visualization |
87 |
|
|
3.1 Uses of Data Visualization |
87 |
|
|
Base R or ggplot? |
89 |
|
|
3.2 Data Examples |
89 |
|
|
Example 1: Boston Housing Data |
89 |
|
|
Example 2: Ridership on Amtrak Trains |
91 |
|
|
3.3 Basic Charts: Bar Charts, Line Graphs, and Scatter Plots |
91 |
|
|
Distribution Plots: Boxplots and Histograms |
93 |
|
|
Heatmaps: Visualizing Correlations and Missing Values |
96 |
|
|
3.4 Multidimensional Visualization |
99 |
|
|
Adding Variables: Color, Size, Shape, Multiple Panels, and Animation |
99 |
|
|
Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, Filtering |
102 |
|
|
Reference: Trend Lines and Labels |
106 |
|
|
Scaling up to Large Datasets |
106 |
|
|
Multivariate Plot: Parallel Coordinates Plot |
107 |
|
|
Interactive Visualization |
109 |
|
|
3.5 Specialized Visualizations |
112 |
|
|
Visualizing Networked Data |
112 |
|
|
Visualizing Hierarchical Data: Treemaps |
114 |
|
|
Visualizing Geographical Data: Map Charts |
115 |
|
|
3.6 Summary: Major Visualizations and Operations, by Data Mining Goal |
118 |
|
|
Prediction |
118 |
|
|
Classification |
118 |
|
|
Time Series Forecasting |
118 |
|
|
Unsupervised Learning |
119 |
|
|
Problems |
120 |
|
|
CHAPTER 4 Dimension Reduction |
123 |
|
|
4.1 Introduction |
123 |
|
|
4.2 Curse of Dimensionality |
124 |
|
|
4.3 Practical Considerations |
124 |
|
|
Example 1: House Prices in Boston |
125 |
|
|
4.4 Data Summaries |
126 |
|
|
Summary Statistics |
126 |
|
|
Aggregation and Pivot Tables |
128 |
|
|
4.5 Correlation Analysis |
129 |
|
|
4.6 Reducing the Number of Categories in Categorical Variables |
131 |
|
|
4.7 Converting a Categorical Variable to a Numerical Variable |
131 |
|
|
4.8 Principal Components Analysis |
133 |
|
|
Example 2: Breakfast Cereals |
133 |
|
|
Principal Components |
138 |
|
|
Normalizing the Data |
139 |
|
|
Using Principal Components for Classification and Prediction |
141 |
|
|
4.9 Dimension Reduction Using Regression Models |
143 |
|
|
4.10 Dimension Reduction Using Classification and Regression Trees |
143 |
|
|
Problems |
144 |
|
|
PART III PERFORMANCE EVALUATION |
147 |
|
|
CHAPTER 5 Evaluating Predictive Performance |
149 |
|
|
5.1 Introduction |
149 |
|
|
5.2 Evaluating Predictive Performance |
150 |
|
|
Naive Benchmark: The Average |
150 |
|
|
Prediction Accuracy Measures |
151 |
|
|
Comparing Training and Validation Performance |
153 |
|
|
Lift Chart |
153 |
|
|
5.3 Judging Classifier Performance |
154 |
|
|
Benchmark: The Naive Rule |
156 |
|
|
Class Separation |
156 |
|
|
The Confusion (Classification) Matrix |
156 |
|
|
Using the Validation Data |
158 |
|
|
Accuracy Measures |
158 |
|
|
Propensities and Cutoff for Classification |
159 |
|
|
Performance in Case of Unequal Importance of Classes |
163 |
|
|
Asymmetric Misclassification Costs |
165 |
|
|
Generalization to More Than Two Classes |
167 |
|
|
5.4 Judging Ranking Performance |
168 |
|
|
Lift Charts for Binary Data |
168 |
|
|
Decile Lift Charts |
170 |
|
|
Beyond Two Classes |
171 |
|
|
Lift Charts Incorporating Costs and Benefits |
171 |
|
|
Lift as a Function of Cutoff |
172 |
|
|
5.5 Oversampling |
172 |
|
|
Oversampling the Training Set |
176 |
|
|
Evaluating Model Performance Using a Non-oversampled Validation Set |
176 |
|
|
Evaluating Model Performance if Only Oversampled Validation Set Exists |
176 |
|
|
Problems |
179 |
|
|
PART IV PREDICTION AND CLASSIFICATION METHODS |
183 |
|
|
CHAPTER 6 Multiple Linear Regression |
185 |
|
|
6.1 Introduction |
185 |
|
|
6.2 Explanatory vs. Predictive Modeling |
186 |
|
|
6.3 Estimating the Regression Equation and Prediction |
188 |
|
|
Example: Predicting the Price of Used Toyota Corolla Cars |
188 |
|
|
6.4 Variable Selection in Linear Regression |
193 |
|
|
Reducing the Number of Predictors |
193 |
|
|
How to Reduce the Number of Predictors |
194 |
|
|
Problems |
201 |
|
|
CHAPTER 7 k-Nearest Neighbors (kNN) |
205 |
|
|
7.1 The k-NN Classifier (Categorical Outcome) |
205 |
|
|
Determining Neighbors |
205 |
|
|
Classification Rule |
206 |
|
|
Example: Riding Mowers |
207 |
|
|
Choosing k |
208 |
|
|
Setting the Cutoff Value |
211 |
|
|
k-NN with More Than Two Classes |
212 |
|
|
Converting Categorical Variables to Binary Dummies |
212 |
|
|
7.2 k-NN for a Numerical Outcome |
212 |
|
|
7.3 Advantages and Shortcomings of k-NN Algorithms |
214 |
|
|
Problems |
216 |
|
|
CHAPTER 8 The Naive Bayes Classifier |
219 |
|
|
8.1 Introduction |
219 |
|
|
Cutoff Probability Method |
220 |
|
|
Conditional Probability |
220 |
|
|
Example 1: Predicting Fraudulent Financial Reporting |
220 |
|
|
8.2 Applying the Full (Exact) Bayesian Classifier |
221 |
|
|
Using the “Assign to the Most Probable Class” Method |
222 |
|
|
Using the Cutoff Probability Method |
222 |
|
|
Practical Difficulty with the Complete (Exact) Bayes Procedure |
222 |
|
|
Solution: Naive Bayes |
223 |
|
|
The Naive Bayes Assumption of Conditional Independence |
224 |
|
|
Using the Cutoff Probability Method |
224 |
|
|
Example 2: Predicting Fraudulent Financial Reports, Two Predictors |
225 |
|
|
Example 3: Predicting Delayed Flights |
226 |
|
|
8.3 Advantages and Shortcomings of the Naive Bayes Classifier |
231 |
|
|
Problems |
234 |
|
|
CHAPTER 9 Classification and Regression Trees |
237 |
|
|
9.1 Introduction |
237 |
|
|
9.2 Classification Trees |
239 |
|
|
Recursive Partitioning |
239 |
|
|
Example 1: Riding Mowers |
239 |
|
|
Measures of Impurity |
242 |
|
|
Tree Structure |
246 |
|
|
Classifying a New Record |
246 |
|
|
9.3 Evaluating the Performance of a Classification Tree |
247 |
|
|
Example 2: Acceptance of Personal Loan |
247 |
|
|
9.4 Avoiding Overfitting |
248 |
|
|
Stopping Tree Growth: Conditional Inference Trees |
253 |
|
|
Pruning the Tree |
254 |
|
|
Cross-Validation |
254 |
|
|
Best-Pruned Tree |
256 |
|
|
9.5 Classification Rules from Trees |
258 |
|
|
9.6 Classification Trees for More Than Two Classes |
259 |
|
|
9.7 Regression Trees |
259 |
|
|
Prediction |
260 |
|
|
Measuring Impurity |
260 |
|
|
Evaluating Performance |
261 |
|
|
9.8 Improving Prediction: Random Forests and Boosted Trees |
261 |
|
|
Random Forests |
261 |
|
|
Boosted Trees |
263 |
|
|
9.9 Advantages and Weaknesses of a Tree |
264 |
|
|
Problems |
266 |
|
|
CHAPTER 10 Logistic Regression |
269 |
|
|
10.1 Introduction |
269 |
|
|
10.2 The Logistic Regression Model |
271 |
|
|
10.3 Example: Acceptance of Personal Loan |
272 |
|
|
Model with a Single Predictor |
273 |
|
|
Estimating the Logistic Model from Data: Computing Parameter Estimates |
275 |
|
|
Interpreting Results in Terms of Odds (for a Profiling Goal) |
276 |
|
|
10.4 Evaluating Classification Performance |
279 |
|
|
Variable Selection |
280 |
|
|
10.5 Example of Complete Analysis: Predicting Delayed Flights |
282 |
|
|
Data Preprocessing |
283 |
|
|
Model-Fitting and Estimation |
286 |
|
|
Model Interpretation |
286 |
|
|
Model Performance |
286 |
|
|
Variable Selection |
289 |
|
|
10.6 Appendix: Logistic Regression for Profiling |
291 |
|
|
Appendix A: Why Linear Regression Is Problematic for a Categorical Outcome |
291 |
|
|
Appendix B: Evaluating Explanatory Power |
293 |
|
|
Appendix C: Logistic Regression for More Than Two Classes |
296 |
|
|
Problems |
300 |
|
|
CHAPTER 11 Neural Nets |
303 |
|
|
11.1 Introduction |
303 |
|
|
11.2 Concept and Structure of a Neural Network |
304 |
|
|
11.3 Fitting a Network to Data |
305 |
|
|
Example 1: Tiny Dataset |
305 |
|
|
Computing Output of Nodes |
306 |
|
|
Preprocessing the Data |
309 |
|
|
Training the Model |
310 |
|
|
Example 2: Classifying Accident Severity |
314 |
|
|
Avoiding Overfitting |
315 |
|
|
Using the Output for Prediction and Classification |
315 |
|
|
11.4 Required User Input |
317 |
|
|
11.5 Exploring the Relationship Between Predictors and Outcome |
319 |
|
|
11.6 Advantages and Weaknesses of Neural Networks |
320 |
|
|
Problems |
322 |
|
|
CHAPTER 12 Discriminant Analysis |
325 |
|
|
12.1 Introduction |
325 |
|
|
Example 1: Riding Mowers |
326 |
|
|
Example 2: Personal Loan Acceptance |
326 |
|
|
12.2 Distance of a Record from a Class |
328 |
|
|
12.3 Fisher’s Linear Classification Functions |
329 |
|
|
12.4 Classification Performance of Discriminant Analysis |
332 |
|
|
12.5 Prior Probabilities |
334 |
|
|
12.6 Unequal Misclassification Costs |
334 |
|
|
12.7 Classifying More Than Two Classes |
335 |
|
|
Example 3: Medical Dispatch to Accident Scenes |
335 |
|
|
12.8 Advantages and Weaknesses |
338 |
|
|
Problems |
339 |
|
|
CHAPTER 13 Combining Methods: Ensembles and Uplift Modeling |
343 |
|
|
13.1 Ensembles |
343 |
|
|
Why Ensembles Can Improve Predictive Power |
344 |
|
|
Simple Averaging |
346 |
|
|
Bagging |
347 |
|
|
Boosting |
347 |
|
|
Bagging and Boosting in R |
347 |
|
|
Advantages and Weaknesses of Ensembles |
347 |
|
|
13.2 Uplift (Persuasion) Modeling |
1 |
|
|
A-B Testing |
350 |
|
|
Uplift |
350 |
|
|
Gathering the Data |
351 |
|
|
A Simple Model |
352 |
|
|
Modeling Individual Uplift |
353 |
|
|
Computing Uplift with R |
354 |
|
|
Using the Results of an Uplift Model |
354 |
|
|
13.3 Summary |
356 |
|
|
Problems |
357 |
|
|
PART V MINING RELATIONSHIPS AMONG RECORDS |
359 |
|
|
CHAPTER 14 Association Rules and Collaborative Filtering |
361 |
|
|
14.1 Association Rules |
361 |
|
|
Discovering Association Rules in Transaction Databases |
362 |
|
|
Example 1: Synthetic Data on Purchases of Phone Faceplates |
362 |
|
|
Generating Candidate Rules |
362 |
|
|
The Apriori Algorithm |
365 |
|
|
Selecting Strong Rules |
365 |
|
|
Data Format |
367 |
|
|
The Process of Rule Selection |
368 |
|
|
Interpreting the Results |
369 |
|
|
Rules and Chance |
371 |
|
|
Example 2: Rules for Similar Book Purchases |
372 |
|
|
14.2 Collaborative Filtering |
374 |
|
|
Data Type and Format |
375 |
|
|
Example 3: Netflix Prize Contest |
375 |
|
|
User-Based Collaborative Filtering: “People Like You” |
376 |
|
|
Item-Based Collaborative Filtering |
379 |
|
|
Advantages and Weaknesses of Collaborative Filtering |
380 |
|
|
Collaborative Filtering vs. Association Rules |
381 |
|
|
14.3 Summary |
383 |
|
|
Problems |
384 |
|
|
CHAPTER 15 Cluster Analysis |
389 |
|
|
15.1 Introduction |
389 |
|
|
Example: Public Utilities |
391 |
|
|
15.2 Measuring Distance Between Two Records |
393 |
|
|
Euclidean Distance |
393 |
|
|
Normalizing Numerical Measurements |
394 |
|
|
Other Distance Measures for Numerical Data |
394 |
|
|
Distance Measures for Categorical Data |
397 |
|
|
Distance Measures for Mixed Data |
398 |
|
|
15.3 Measuring Distance Between Two Clusters |
398 |
|
|
Minimum Distance |
398 |
|
|
Maximum Distance |
398 |
|
|
Average Distance |
399 |
|
|
Centroid Distance |
399 |
|
|
15.4 Hierarchical (Agglomerative) Clustering |
400 |
|
|
Single Linkage |
401 |
|
|
Complete Linkage |
402 |
|
|
Average Linkage |
402 |
|
|
Centroid Linkage |
402 |
|
|
Ward’s Method |
402 |
|
|
Dendrograms: Displaying Clustering Process and Results |
403 |
|
|
Validating Clusters |
405 |
|
|
Limitations of Hierarchical Clustering |
407 |
|
|
15.5 Non-Hierarchical Clustering: The k-Means Algorithm |
408 |
|
|
Choosing the Number of Clusters (k) |
409 |
|
|
Problems |
414 |
|
|
PART VI FORECASTING TIME SERIES |
417 |
|
|
CHAPTER 16 Handling Time Series |
419 |
|
|
16.1 Introduction |
419 |
|
|
16.2 Descriptive vs. Predictive Modeling |
421 |
|
|
16.3 Popular Forecasting Methods in Business |
421 |
|
|
Combining Methods |
421 |
|
|
16.4 Time Series Components |
422 |
|
|
Example: Ridership on Amtrak Trains |
422 |
|
|
16.5 Data-Partitioning and Performance Evaluation |
427 |
|
|
Benchmark Performance: Naive Forecasts |
427 |
|
|
Generating Future Forecasts |
428 |
|
|
Problems |
430 |
|
|
CHAPTER 17 Regression-Based Forecasting |
433 |
|
|
17.1 A Model with Trend |
433 |
|
|
Linear Trend |
433 |
|
|
Exponential Trend |
437 |
|
|
Polynomial Trend |
439 |
|
|
17.2 A Model with Seasonality |
439 |
|
|
17.3 A Model with Trend and Seasonality |
443 |
|
|
17.4 Autocorrelation and ARIMA Models |
444 |
|
|
Computing Autocorrelation |
445 |
|
|
Improving Forecasts by Integrating Autocorrelation Information |
448 |
|
|
Evaluating Predictability |
452 |
|
|
Problems |
454 |
|
|
CHAPTER 18 Smoothing Methods |
465 |
|
|
18.1 Introduction |
465 |
|
|
18.2 Moving Average |
466 |
|
|
Centered Moving Average for Visualization |
466 |
|
|
Trailing Moving Average for Forecasting |
467 |
|
|
Choosing Window Width (w) |
471 |
|
|
18.3 Simple Exponential Smoothing |
471 |
|
|
Choosing Smoothing Parameter |
472 |
|
|
Relation Between Moving Average and Simple Exponential Smoothing |
472 |
|
|
18.4 Advanced Exponential Smoothing |
474 |
|
|
Series with a Trend |
474 |
|
|
Series with a Trend and Seasonality |
475 |
|
|
Series with Seasonality (No Trend) |
475 |
|
|
Problems |
478 |
|
|
PART VII DATA ANALYTICS |
485 |
|
|
CHAPTER 19 Social Network Analytics |
487 |
|
|
19.1 Introduction |
487 |
|
|
19.2 Directed vs. Undirected Networks |
489 |
|
|
19.3 Visualizing and Analyzing Networks |
490 |
|
|
Graph Layout |
490 |
|
|
Edge List |
492 |
|
|
Adjacency Matrix |
493 |
|
|
Using Network Data in Classification and Prediction |
493 |
|
|
19.4 Social Data Metrics and Taxonomy |
494 |
|
|
Node-Level Centrality Metrics |
495 |
|
|
Egocentric Network |
495 |
|
|
Network Metrics |
497 |
|
|
19.5 Using Network Metrics in Prediction and Classification |
499 |
|
|
Link Prediction |
499 |
|
|
Entity Resolution |
499 |
|
|
Collaborative Filtering |
500 |
|
|
19.6 Collecting Social Network Data with R |
503 |
|
|
19.7 Advantages and Disadvantages |
506 |
|
|
Problems |
508 |
|
|
CHAPTER 20 Text Mining |
511 |
|
|
20.1 Introduction |
511 |
|
|
20.2 The Tabular Representation of Text: Term-Document Matrix and “Bag-of-Words” |
512 |
|
|
20.3 Bag-of-Words vs. Meaning Extraction at Document Level |
513 |
|
|
20.4 Preprocessing the Text |
514 |
|
|
Tokenization |
516 |
|
|
Text Reduction |
517 |
|
|
Presence/Absence vs. Frequency |
519 |
|
|
Term Frequency–Inverse Document Frequency (TF-IDF) |
519 |
|
|
From Terms to Concepts: Latent Semantic Indexing |
520 |
|
|
Extracting Meaning |
521 |
|
|
20.5 Implementing Data Mining Methods |
521 |
|
|
20.6 Example: Online Discussions on Autos and Electronics |
522 |
|
|
Importing and Labeling the Records |
522 |
|
|
Text Preprocessing in R |
523 |
|
|
Producing a Concept Matrix |
523 |
|
|
Fitting a Predictive Model |
524 |
|
|
Prediction |
524 |
|
|
20.7 Summary |
526 |
|
|
Problems |
527 |
|
|
PART VIII CASES |
529 |
|
|
CHAPTER 21 Cases |
531 |
|
|
21.1 Charles Book Club |
531 |
|
|
The Book Industry |
531 |
|
|
Database Marketing at Charles |
532 |
|
|
Data Mining Techniques |
534 |
|
|
Assignment |
536 |
|
|
21.2 German Credit |
537 |
|
|
Background |
537 |
|
|
Data |
538 |
|
|
Assignment |
539 |
|
|
21.3 Tayko Software Cataloger |
542 |
|
|
Background |
542 |
|
|
The Mailing Experiment |
542 |
|
|
Data |
542 |
|
|
Assignment |
544 |
|
|
21.4 Political Persuasion |
545 |
|
|
Background |
545 |
|
|
Predictive Analytics Arrives in US Politics |
545 |
|
|
Political Targeting |
546 |
|
|
Uplift |
546 |
|
|
Data |
547 |
|
|
Assignment |
548 |
|
|
21.5 Taxi Cancellations |
549 |
|
|
Business Situation |
549 |
|
|
Assignment |
549 |
|
|
21.6 Segmenting Consumers of Bath Soap |
550 |
|
|
Business Situation |
550 |
|
|
Key Problems |
551 |
|
|
Data |
551 |
|
|
Measuring Brand Loyalty |
551 |
|
|
Assignment |
553 |
|
|
21.7 Direct-Mail Fundraising |
553 |
|
|
Background |
553 |
|
|
Data |
554 |
|
|
Assignment |
555 |
|
|
21.8 Catalog Cross-Selling |
556 |
|
|
Background |
556 |
|
|
Assignment |
556 |
|
|
21.9 Predicting Bankruptcy |
557 |
|
|
Predicting Corporate Bankruptcy |
557 |
|
|
Assignment |
558 |
|
|
21.10 Time Series Case: Forecasting Public Transportation Demand |
560 |
|
|
Background |
560 |
|
|
Problem Description |
560 |
|
|
Available Data |
560 |
|
|
Assignment Goal |
560 |
|
|
Assignment |
561 |
|
|
Tips and Suggested Steps |
561 |
|
|
References |
563 |
|
|
Data Files Used in the Book |
565 |
|
|
Index |
567 |
|
|
EULA |
577 |
|