ciando eBooks - ein Service Ihrer Bibliothek

	Cover	1
	Title Page	5
	Copyright	6
	Contents	9
	Foreword by Gareth James	21
	Foreword by Ravi Bapna	23
	Preface to the R Edition	25
	Acknowledgments	29
	PART I PRELIMINARIES	33
	CHAPTER 1 Introduction	35
	1.1 What Is Business Analytics?	35
	1.2 What Is Data Mining?	37
	1.3 Data Mining and Related Terms	37
	1.4 Big Data	38
	1.5 Data Science	39
	1.6 Why Are There So Many Different Methods?	40
	1.7 Terminology and Notation	41
	1.8 Road Maps to This Book	43
	Order of Topics	43
	CHAPTER 2 Overview of the Data Mining Process	47
	2.1 Introduction	47
	2.2 Core Ideas in Data Mining	48
	Classification	48
	Prediction	48
	Association Rules and Recommendation Systems	48
	Predictive Analytics	49
	Data Reduction and Dimension Reduction	49
	Data Exploration and Visualization	49
	Supervised and Unsupervised Learning	50
	2.3 The Steps in Data Mining	51
	2.4 Preliminary Steps	53
	Organization of Datasets	53
	Predicting Home Values in the West Roxbury Neighborhood	53
	Loading and Looking at the Data in R	1
	Sampling from a Database	56
	Oversampling Rare Events in Classification Tasks	57
	Preprocessing and Cleaning the Data	1
	2.5 Predictive Power and Overfitting	65
	Overfitting	65
	Creation and Use of Data Partitions	67
	2.6 Building a Predictive Model	70
	Modeling Process	71
	2.7 Using R for Data Mining on a Local Machine	75
	2.8 Automating Data Mining Solutions	75
	Data Mining Software: The State of the Market (by Herb Edelstein)	77
	Problems	81
	PART II DATA EXPLORATION AND DIMENSION REDUCTION	85
	CHAPTER 3 Data Visualization	87
	3.1 Uses of Data Visualization	87
	Base R or ggplot?	89
	3.2 Data Examples	89
	Example 1: Boston Housing Data	89
	Example 2: Ridership on Amtrak Trains	91
	3.3 Basic Charts: Bar Charts, Line Graphs, and Scatter Plots	91
	Distribution Plots: Boxplots and Histograms	93
	Heatmaps: Visualizing Correlations and Missing Values	96
	3.4 Multidimensional Visualization	99
	Adding Variables: Color, Size, Shape, Multiple Panels, and Animation	99
	Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, Filtering	102
	Reference: Trend Lines and Labels	106
	Scaling up to Large Datasets	106
	Multivariate Plot: Parallel Coordinates Plot	107
	Interactive Visualization	109
	3.5 Specialized Visualizations	112
	Visualizing Networked Data	112
	Visualizing Hierarchical Data: Treemaps	114
	Visualizing Geographical Data: Map Charts	115
	3.6 Summary: Major Visualizations and Operations, by Data Mining Goal	118
	Prediction	118
	Classification	118
	Time Series Forecasting	118
	Unsupervised Learning	119
	Problems	120
	CHAPTER 4 Dimension Reduction	123
	4.1 Introduction	123
	4.2 Curse of Dimensionality	124
	4.3 Practical Considerations	124
	Example 1: House Prices in Boston	125
	4.4 Data Summaries	126
	Summary Statistics	126
	Aggregation and Pivot Tables	128
	4.5 Correlation Analysis	129
	4.6 Reducing the Number of Categories in Categorical Variables	131
	4.7 Converting a Categorical Variable to a Numerical Variable	131
	4.8 Principal Components Analysis	133
	Example 2: Breakfast Cereals	133
	Principal Components	138
	Normalizing the Data	139
	Using Principal Components for Classification and Prediction	141
	4.9 Dimension Reduction Using Regression Models	143
	4.10 Dimension Reduction Using Classification and Regression Trees	143
	Problems	144
	PART III PERFORMANCE EVALUATION	147
	CHAPTER 5 Evaluating Predictive Performance	149
	5.1 Introduction	149
	5.2 Evaluating Predictive Performance	150
	Naive Benchmark: The Average	150
	Prediction Accuracy Measures	151
	Comparing Training and Validation Performance	153
	Lift Chart	153
	5.3 Judging Classifier Performance	154
	Benchmark: The Naive Rule	156
	Class Separation	156
	The Confusion (Classification) Matrix	156
	Using the Validation Data	158
	Accuracy Measures	158
	Propensities and Cutoff for Classification	159
	Performance in Case of Unequal Importance of Classes	163
	Asymmetric Misclassification Costs	165
	Generalization to More Than Two Classes	167
	5.4 Judging Ranking Performance	168
	Lift Charts for Binary Data	168
	Decile Lift Charts	170
	Beyond Two Classes	171
	Lift Charts Incorporating Costs and Benefits	171
	Lift as a Function of Cutoff	172
	5.5 Oversampling	172
	Oversampling the Training Set	176
	Evaluating Model Performance Using a Non-oversampled Validation Set	176
	Evaluating Model Performance if Only Oversampled Validation Set Exists	176
	Problems	179
	PART IV PREDICTION AND CLASSIFICATION METHODS	183
	CHAPTER 6 Multiple Linear Regression	185
	6.1 Introduction	185
	6.2 Explanatory vs. Predictive Modeling	186
	6.3 Estimating the Regression Equation and Prediction	188
	Example: Predicting the Price of Used Toyota Corolla Cars	188
	6.4 Variable Selection in Linear Regression	193
	Reducing the Number of Predictors	193
	How to Reduce the Number of Predictors	194
	Problems	201
	CHAPTER 7 k-Nearest Neighbors (kNN)	205
	7.1 The k-NN Classifier (Categorical Outcome)	205
	Determining Neighbors	205
	Classification Rule	206
	Example: Riding Mowers	207
	Choosing k	208
	Setting the Cutoff Value	211
	k-NN with More Than Two Classes	212
	Converting Categorical Variables to Binary Dummies	212
	7.2 k-NN for a Numerical Outcome	212
	7.3 Advantages and Shortcomings of k-NN Algorithms	214
	Problems	216
	CHAPTER 8 The Naive Bayes Classifier	219
	8.1 Introduction	219
	Cutoff Probability Method	220
	Conditional Probability	220
	Example 1: Predicting Fraudulent Financial Reporting	220
	8.2 Applying the Full (Exact) Bayesian Classifier	221
	Using the “Assign to the Most Probable Class” Method	222
	Using the Cutoff Probability Method	222
	Practical Difficulty with the Complete (Exact) Bayes Procedure	222
	Solution: Naive Bayes	223
	The Naive Bayes Assumption of Conditional Independence	224
	Using the Cutoff Probability Method	224
	Example 2: Predicting Fraudulent Financial Reports, Two Predictors	225
	Example 3: Predicting Delayed Flights	226
	8.3 Advantages and Shortcomings of the Naive Bayes Classifier	231
	Problems	234
	CHAPTER 9 Classification and Regression Trees	237
	9.1 Introduction	237
	9.2 Classification Trees	239
	Recursive Partitioning	239
	Example 1: Riding Mowers	239
	Measures of Impurity	242
	Tree Structure	246
	Classifying a New Record	246
	9.3 Evaluating the Performance of a Classification Tree	247
	Example 2: Acceptance of Personal Loan	247
	9.4 Avoiding Overfitting	248
	Stopping Tree Growth: Conditional Inference Trees	253
	Pruning the Tree	254
	Cross-Validation	254
	Best-Pruned Tree	256
	9.5 Classification Rules from Trees	258
	9.6 Classification Trees for More Than Two Classes	259
	9.7 Regression Trees	259
	Prediction	260
	Measuring Impurity	260
	Evaluating Performance	261
	9.8 Improving Prediction: Random Forests and Boosted Trees	261
	Random Forests	261
	Boosted Trees	263
	9.9 Advantages and Weaknesses of a Tree	264
	Problems	266
	CHAPTER 10 Logistic Regression	269
	10.1 Introduction	269
	10.2 The Logistic Regression Model	271
	10.3 Example: Acceptance of Personal Loan	272
	Model with a Single Predictor	273
	Estimating the Logistic Model from Data: Computing Parameter Estimates	275
	Interpreting Results in Terms of Odds (for a Profiling Goal)	276
	10.4 Evaluating Classification Performance	279
	Variable Selection	280
	10.5 Example of Complete Analysis: Predicting Delayed Flights	282
	Data Preprocessing	283
	Model-Fitting and Estimation	286
	Model Interpretation	286
	Model Performance	286
	Variable Selection	289
	10.6 Appendix: Logistic Regression for Profiling	291
	Appendix A: Why Linear Regression Is Problematic for a Categorical Outcome	291
	Appendix B: Evaluating Explanatory Power	293
	Appendix C: Logistic Regression for More Than Two Classes	296
	Problems	300
	CHAPTER 11 Neural Nets	303
	11.1 Introduction	303
	11.2 Concept and Structure of a Neural Network	304
	11.3 Fitting a Network to Data	305
	Example 1: Tiny Dataset	305
	Computing Output of Nodes	306
	Preprocessing the Data	309
	Training the Model	310
	Example 2: Classifying Accident Severity	314
	Avoiding Overfitting	315
	Using the Output for Prediction and Classification	315
	11.4 Required User Input	317
	11.5 Exploring the Relationship Between Predictors and Outcome	319
	11.6 Advantages and Weaknesses of Neural Networks	320
	Problems	322
	CHAPTER 12 Discriminant Analysis	325
	12.1 Introduction	325
	Example 1: Riding Mowers	326
	Example 2: Personal Loan Acceptance	326
	12.2 Distance of a Record from a Class	328
	12.3 Fisher’s Linear Classification Functions	329
	12.4 Classification Performance of Discriminant Analysis	332
	12.5 Prior Probabilities	334
	12.6 Unequal Misclassification Costs	334
	12.7 Classifying More Than Two Classes	335
	Example 3: Medical Dispatch to Accident Scenes	335
	12.8 Advantages and Weaknesses	338
	Problems	339
	CHAPTER 13 Combining Methods: Ensembles and Uplift Modeling	343
	13.1 Ensembles	343
	Why Ensembles Can Improve Predictive Power	344
	Simple Averaging	346
	Bagging	347
	Boosting	347
	Bagging and Boosting in R	347
	Advantages and Weaknesses of Ensembles	347
	13.2 Uplift (Persuasion) Modeling	1
	A-B Testing	350
	Uplift	350
	Gathering the Data	351
	A Simple Model	352
	Modeling Individual Uplift	353
	Computing Uplift with R	354
	Using the Results of an Uplift Model	354
	13.3 Summary	356
	Problems	357
	PART V MINING RELATIONSHIPS AMONG RECORDS	359
	CHAPTER 14 Association Rules and Collaborative Filtering	361
	14.1 Association Rules	361
	Discovering Association Rules in Transaction Databases	362
	Example 1: Synthetic Data on Purchases of Phone Faceplates	362
	Generating Candidate Rules	362
	The Apriori Algorithm	365
	Selecting Strong Rules	365
	Data Format	367
	The Process of Rule Selection	368
	Interpreting the Results	369
	Rules and Chance	371
	Example 2: Rules for Similar Book Purchases	372
	14.2 Collaborative Filtering	374
	Data Type and Format	375
	Example 3: Netflix Prize Contest	375
	User-Based Collaborative Filtering: “People Like You”	376
	Item-Based Collaborative Filtering	379
	Advantages and Weaknesses of Collaborative Filtering	380
	Collaborative Filtering vs. Association Rules	381
	14.3 Summary	383
	Problems	384
	CHAPTER 15 Cluster Analysis	389
	15.1 Introduction	389
	Example: Public Utilities	391
	15.2 Measuring Distance Between Two Records	393
	Euclidean Distance	393
	Normalizing Numerical Measurements	394
	Other Distance Measures for Numerical Data	394
	Distance Measures for Categorical Data	397
	Distance Measures for Mixed Data	398
	15.3 Measuring Distance Between Two Clusters	398
	Minimum Distance	398
	Maximum Distance	398
	Average Distance	399
	Centroid Distance	399
	15.4 Hierarchical (Agglomerative) Clustering	400
	Single Linkage	401
	Complete Linkage	402
	Average Linkage	402
	Centroid Linkage	402
	Ward’s Method	402
	Dendrograms: Displaying Clustering Process and Results	403
	Validating Clusters	405
	Limitations of Hierarchical Clustering	407
	15.5 Non-Hierarchical Clustering: The k-Means Algorithm	408
	Choosing the Number of Clusters (k)	409
	Problems	414
	PART VI FORECASTING TIME SERIES	417
	CHAPTER 16 Handling Time Series	419
	16.1 Introduction	419
	16.2 Descriptive vs. Predictive Modeling	421
	16.3 Popular Forecasting Methods in Business	421
	Combining Methods	421
	16.4 Time Series Components	422
	Example: Ridership on Amtrak Trains	422
	16.5 Data-Partitioning and Performance Evaluation	427
	Benchmark Performance: Naive Forecasts	427
	Generating Future Forecasts	428
	Problems	430
	CHAPTER 17 Regression-Based Forecasting	433
	17.1 A Model with Trend	433
	Linear Trend	433
	Exponential Trend	437
	Polynomial Trend	439
	17.2 A Model with Seasonality	439
	17.3 A Model with Trend and Seasonality	443
	17.4 Autocorrelation and ARIMA Models	444
	Computing Autocorrelation	445
	Improving Forecasts by Integrating Autocorrelation Information	448
	Evaluating Predictability	452
	Problems	454
	CHAPTER 18 Smoothing Methods	465
	18.1 Introduction	465
	18.2 Moving Average	466
	Centered Moving Average for Visualization	466
	Trailing Moving Average for Forecasting	467
	Choosing Window Width (w)	471
	18.3 Simple Exponential Smoothing	471
	Choosing Smoothing Parameter	472
	Relation Between Moving Average and Simple Exponential Smoothing	472
	18.4 Advanced Exponential Smoothing	474
	Series with a Trend	474
	Series with a Trend and Seasonality	475
	Series with Seasonality (No Trend)	475
	Problems	478
	PART VII DATA ANALYTICS	485
	CHAPTER 19 Social Network Analytics	487
	19.1 Introduction	487
	19.2 Directed vs. Undirected Networks	489
	19.3 Visualizing and Analyzing Networks	490
	Graph Layout	490
	Edge List	492
	Adjacency Matrix	493
	Using Network Data in Classification and Prediction	493
	19.4 Social Data Metrics and Taxonomy	494
	Node-Level Centrality Metrics	495
	Egocentric Network	495
	Network Metrics	497
	19.5 Using Network Metrics in Prediction and Classification	499
	Link Prediction	499
	Entity Resolution	499
	Collaborative Filtering	500
	19.6 Collecting Social Network Data with R	503
	19.7 Advantages and Disadvantages	506
	Problems	508
	CHAPTER 20 Text Mining	511
	20.1 Introduction	511
	20.2 The Tabular Representation of Text: Term-Document Matrix and “Bag-of-Words”	512
	20.3 Bag-of-Words vs. Meaning Extraction at Document Level	513
	20.4 Preprocessing the Text	514
	Tokenization	516
	Text Reduction	517
	Presence/Absence vs. Frequency	519
	Term Frequency–Inverse Document Frequency (TF-IDF)	519
	From Terms to Concepts: Latent Semantic Indexing	520
	Extracting Meaning	521
	20.5 Implementing Data Mining Methods	521
	20.6 Example: Online Discussions on Autos and Electronics	522
	Importing and Labeling the Records	522
	Text Preprocessing in R	523
	Producing a Concept Matrix	523
	Fitting a Predictive Model	524
	Prediction	524
	20.7 Summary	526
	Problems	527
	PART VIII CASES	529
	CHAPTER 21 Cases	531
	21.1 Charles Book Club	531
	The Book Industry	531
	Database Marketing at Charles	532
	Data Mining Techniques	534
	Assignment	536
	21.2 German Credit	537
	Background	537
	Data	538
	Assignment	539
	21.3 Tayko Software Cataloger	542
	Background	542
	The Mailing Experiment	542
	Data	542
	Assignment	544
	21.4 Political Persuasion	545
	Background	545
	Predictive Analytics Arrives in US Politics	545
	Political Targeting	546
	Uplift	546
	Data	547
	Assignment	548
	21.5 Taxi Cancellations	549
	Business Situation	549
	Assignment	549
	21.6 Segmenting Consumers of Bath Soap	550
	Business Situation	550
	Key Problems	551
	Data	551
	Measuring Brand Loyalty	551
	Assignment	553
	21.7 Direct-Mail Fundraising	553
	Background	553
	Data	554
	Assignment	555
	21.8 Catalog Cross-Selling	556
	Background	556
	Assignment	556
	21.9 Predicting Bankruptcy	557
	Predicting Corporate Bankruptcy	557
	Assignment	558
	21.10 Time Series Case: Forecasting Public Transportation Demand	560
	Background	560
	Problem Description	560
	Available Data	560
	Assignment Goal	560
	Assignment	561
	Tips and Suggested Steps	561
	References	563
	Data Files Used in the Book	565
	Index	567
	EULA	577