Hilfe Warenkorb Konto Anmelden
 
 
   Schnellsuche   
     zur Expertensuche                      
Data Mining for Business Analytics - Concepts, Techniques, and Applications in R
  Großes Bild
 
Data Mining for Business Analytics - Concepts, Techniques, and Applications in R
von: Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, Kenneth C. Lichtendahl
Wiley, 2017
ISBN: 9781118956632
576 Seiten, Download: 18429 KB
 
Format:  PDF
geeignet für: Apple iPad, Android Tablet PC's Online-Lesen PC, MAC, Laptop

Typ: A (einfacher Zugriff)

 

 
eBook anfordern
Inhaltsverzeichnis

  Cover 1  
  Title Page 5  
  Copyright 6  
  Contents 9  
  Foreword by Gareth James 21  
  Foreword by Ravi Bapna 23  
  Preface to the R Edition 25  
  Acknowledgments 29  
  PART I PRELIMINARIES 33  
     CHAPTER 1 Introduction 35  
        1.1 What Is Business Analytics? 35  
        1.2 What Is Data Mining? 37  
        1.3 Data Mining and Related Terms 37  
        1.4 Big Data 38  
        1.5 Data Science 39  
        1.6 Why Are There So Many Different Methods? 40  
        1.7 Terminology and Notation 41  
        1.8 Road Maps to This Book 43  
           Order of Topics 43  
     CHAPTER 2 Overview of the Data Mining Process 47  
        2.1 Introduction 47  
        2.2 Core Ideas in Data Mining 48  
           Classification 48  
           Prediction 48  
           Association Rules and Recommendation Systems 48  
           Predictive Analytics 49  
           Data Reduction and Dimension Reduction 49  
           Data Exploration and Visualization 49  
           Supervised and Unsupervised Learning 50  
        2.3 The Steps in Data Mining 51  
        2.4 Preliminary Steps 53  
           Organization of Datasets 53  
           Predicting Home Values in the West Roxbury Neighborhood 53  
           Loading and Looking at the Data in R 1  
           Sampling from a Database 56  
           Oversampling Rare Events in Classification Tasks 57  
           Preprocessing and Cleaning the Data 1  
        2.5 Predictive Power and Overfitting 65  
           Overfitting 65  
           Creation and Use of Data Partitions 67  
        2.6 Building a Predictive Model 70  
           Modeling Process 71  
        2.7 Using R for Data Mining on a Local Machine 75  
        2.8 Automating Data Mining Solutions 75  
           Data Mining Software: The State of the Market (by Herb Edelstein) 77  
        Problems 81  
  PART II DATA EXPLORATION AND DIMENSION REDUCTION 85  
     CHAPTER 3 Data Visualization 87  
        3.1 Uses of Data Visualization 87  
           Base R or ggplot? 89  
        3.2 Data Examples 89  
           Example 1: Boston Housing Data 89  
           Example 2: Ridership on Amtrak Trains 91  
        3.3 Basic Charts: Bar Charts, Line Graphs, and Scatter Plots 91  
           Distribution Plots: Boxplots and Histograms 93  
           Heatmaps: Visualizing Correlations and Missing Values 96  
        3.4 Multidimensional Visualization 99  
           Adding Variables: Color, Size, Shape, Multiple Panels, and Animation 99  
           Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, Filtering 102  
           Reference: Trend Lines and Labels 106  
           Scaling up to Large Datasets 106  
           Multivariate Plot: Parallel Coordinates Plot 107  
           Interactive Visualization 109  
        3.5 Specialized Visualizations 112  
           Visualizing Networked Data 112  
           Visualizing Hierarchical Data: Treemaps 114  
           Visualizing Geographical Data: Map Charts 115  
        3.6 Summary: Major Visualizations and Operations, by Data Mining Goal 118  
           Prediction 118  
           Classification 118  
           Time Series Forecasting 118  
           Unsupervised Learning 119  
           Problems 120  
     CHAPTER 4 Dimension Reduction 123  
        4.1 Introduction 123  
        4.2 Curse of Dimensionality 124  
        4.3 Practical Considerations 124  
           Example 1: House Prices in Boston 125  
        4.4 Data Summaries 126  
           Summary Statistics 126  
           Aggregation and Pivot Tables 128  
        4.5 Correlation Analysis 129  
        4.6 Reducing the Number of Categories in Categorical Variables 131  
        4.7 Converting a Categorical Variable to a Numerical Variable 131  
        4.8 Principal Components Analysis 133  
           Example 2: Breakfast Cereals 133  
           Principal Components 138  
           Normalizing the Data 139  
           Using Principal Components for Classification and Prediction 141  
        4.9 Dimension Reduction Using Regression Models 143  
        4.10 Dimension Reduction Using Classification and Regression Trees 143  
        Problems 144  
  PART III PERFORMANCE EVALUATION 147  
     CHAPTER 5 Evaluating Predictive Performance 149  
        5.1 Introduction 149  
        5.2 Evaluating Predictive Performance 150  
           Naive Benchmark: The Average 150  
           Prediction Accuracy Measures 151  
           Comparing Training and Validation Performance 153  
           Lift Chart 153  
        5.3 Judging Classifier Performance 154  
           Benchmark: The Naive Rule 156  
           Class Separation 156  
           The Confusion (Classification) Matrix 156  
           Using the Validation Data 158  
           Accuracy Measures 158  
           Propensities and Cutoff for Classification 159  
           Performance in Case of Unequal Importance of Classes 163  
           Asymmetric Misclassification Costs 165  
           Generalization to More Than Two Classes 167  
        5.4 Judging Ranking Performance 168  
           Lift Charts for Binary Data 168  
           Decile Lift Charts 170  
           Beyond Two Classes 171  
           Lift Charts Incorporating Costs and Benefits 171  
           Lift as a Function of Cutoff 172  
        5.5 Oversampling 172  
           Oversampling the Training Set 176  
           Evaluating Model Performance Using a Non-oversampled Validation Set 176  
           Evaluating Model Performance if Only Oversampled Validation Set Exists 176  
        Problems 179  
  PART IV PREDICTION AND CLASSIFICATION METHODS 183  
     CHAPTER 6 Multiple Linear Regression 185  
        6.1 Introduction 185  
        6.2 Explanatory vs. Predictive Modeling 186  
        6.3 Estimating the Regression Equation and Prediction 188  
           Example: Predicting the Price of Used Toyota Corolla Cars 188  
        6.4 Variable Selection in Linear Regression 193  
           Reducing the Number of Predictors 193  
           How to Reduce the Number of Predictors 194  
        Problems 201  
     CHAPTER 7 k-Nearest Neighbors (kNN) 205  
        7.1 The k-NN Classifier (Categorical Outcome) 205  
           Determining Neighbors 205  
           Classification Rule 206  
           Example: Riding Mowers 207  
           Choosing k 208  
           Setting the Cutoff Value 211  
           k-NN with More Than Two Classes 212  
           Converting Categorical Variables to Binary Dummies 212  
        7.2 k-NN for a Numerical Outcome 212  
        7.3 Advantages and Shortcomings of k-NN Algorithms 214  
        Problems 216  
     CHAPTER 8 The Naive Bayes Classifier 219  
        8.1 Introduction 219  
           Cutoff Probability Method 220  
           Conditional Probability 220  
           Example 1: Predicting Fraudulent Financial Reporting 220  
        8.2 Applying the Full (Exact) Bayesian Classifier 221  
           Using the “Assign to the Most Probable Class” Method 222  
           Using the Cutoff Probability Method 222  
           Practical Difficulty with the Complete (Exact) Bayes Procedure 222  
           Solution: Naive Bayes 223  
           The Naive Bayes Assumption of Conditional Independence 224  
           Using the Cutoff Probability Method 224  
           Example 2: Predicting Fraudulent Financial Reports, Two Predictors 225  
           Example 3: Predicting Delayed Flights 226  
        8.3 Advantages and Shortcomings of the Naive Bayes Classifier 231  
        Problems 234  
     CHAPTER 9 Classification and Regression Trees 237  
        9.1 Introduction 237  
        9.2 Classification Trees 239  
           Recursive Partitioning 239  
           Example 1: Riding Mowers 239  
           Measures of Impurity 242  
           Tree Structure 246  
           Classifying a New Record 246  
        9.3 Evaluating the Performance of a Classification Tree 247  
           Example 2: Acceptance of Personal Loan 247  
        9.4 Avoiding Overfitting 248  
           Stopping Tree Growth: Conditional Inference Trees 253  
           Pruning the Tree 254  
           Cross-Validation 254  
           Best-Pruned Tree 256  
        9.5 Classification Rules from Trees 258  
        9.6 Classification Trees for More Than Two Classes 259  
        9.7 Regression Trees 259  
           Prediction 260  
           Measuring Impurity 260  
           Evaluating Performance 261  
        9.8 Improving Prediction: Random Forests and Boosted Trees 261  
           Random Forests 261  
           Boosted Trees 263  
        9.9 Advantages and Weaknesses of a Tree 264  
        Problems 266  
     CHAPTER 10 Logistic Regression 269  
        10.1 Introduction 269  
        10.2 The Logistic Regression Model 271  
        10.3 Example: Acceptance of Personal Loan 272  
           Model with a Single Predictor 273  
           Estimating the Logistic Model from Data: Computing Parameter Estimates 275  
           Interpreting Results in Terms of Odds (for a Profiling Goal) 276  
        10.4 Evaluating Classification Performance 279  
           Variable Selection 280  
        10.5 Example of Complete Analysis: Predicting Delayed Flights 282  
           Data Preprocessing 283  
           Model-Fitting and Estimation 286  
           Model Interpretation 286  
           Model Performance 286  
           Variable Selection 289  
        10.6 Appendix: Logistic Regression for Profiling 291  
           Appendix A: Why Linear Regression Is Problematic for a Categorical Outcome 291  
           Appendix B: Evaluating Explanatory Power 293  
           Appendix C: Logistic Regression for More Than Two Classes 296  
        Problems 300  
     CHAPTER 11 Neural Nets 303  
        11.1 Introduction 303  
        11.2 Concept and Structure of a Neural Network 304  
        11.3 Fitting a Network to Data 305  
           Example 1: Tiny Dataset 305  
           Computing Output of Nodes 306  
           Preprocessing the Data 309  
           Training the Model 310  
           Example 2: Classifying Accident Severity 314  
           Avoiding Overfitting 315  
           Using the Output for Prediction and Classification 315  
        11.4 Required User Input 317  
        11.5 Exploring the Relationship Between Predictors and Outcome 319  
        11.6 Advantages and Weaknesses of Neural Networks 320  
        Problems 322  
     CHAPTER 12 Discriminant Analysis 325  
        12.1 Introduction 325  
           Example 1: Riding Mowers 326  
           Example 2: Personal Loan Acceptance 326  
        12.2 Distance of a Record from a Class 328  
        12.3 Fisher’s Linear Classification Functions 329  
        12.4 Classification Performance of Discriminant Analysis 332  
        12.5 Prior Probabilities 334  
        12.6 Unequal Misclassification Costs 334  
        12.7 Classifying More Than Two Classes 335  
           Example 3: Medical Dispatch to Accident Scenes 335  
        12.8 Advantages and Weaknesses 338  
        Problems 339  
     CHAPTER 13 Combining Methods: Ensembles and Uplift Modeling 343  
        13.1 Ensembles 343  
           Why Ensembles Can Improve Predictive Power 344  
           Simple Averaging 346  
           Bagging 347  
           Boosting 347  
           Bagging and Boosting in R 347  
           Advantages and Weaknesses of Ensembles 347  
        13.2 Uplift (Persuasion) Modeling 1  
           A-B Testing 350  
           Uplift 350  
           Gathering the Data 351  
           A Simple Model 352  
           Modeling Individual Uplift 353  
           Computing Uplift with R 354  
           Using the Results of an Uplift Model 354  
        13.3 Summary 356  
        Problems 357  
  PART V MINING RELATIONSHIPS AMONG RECORDS 359  
     CHAPTER 14 Association Rules and Collaborative Filtering 361  
        14.1 Association Rules 361  
           Discovering Association Rules in Transaction Databases 362  
           Example 1: Synthetic Data on Purchases of Phone Faceplates 362  
           Generating Candidate Rules 362  
           The Apriori Algorithm 365  
           Selecting Strong Rules 365  
           Data Format 367  
           The Process of Rule Selection 368  
           Interpreting the Results 369  
           Rules and Chance 371  
           Example 2: Rules for Similar Book Purchases 372  
        14.2 Collaborative Filtering 374  
           Data Type and Format 375  
           Example 3: Netflix Prize Contest 375  
           User-Based Collaborative Filtering: “People Like You” 376  
           Item-Based Collaborative Filtering 379  
           Advantages and Weaknesses of Collaborative Filtering 380  
           Collaborative Filtering vs. Association Rules 381  
        14.3 Summary 383  
        Problems 384  
     CHAPTER 15 Cluster Analysis 389  
        15.1 Introduction 389  
           Example: Public Utilities 391  
        15.2 Measuring Distance Between Two Records 393  
           Euclidean Distance 393  
           Normalizing Numerical Measurements 394  
           Other Distance Measures for Numerical Data 394  
           Distance Measures for Categorical Data 397  
           Distance Measures for Mixed Data 398  
        15.3 Measuring Distance Between Two Clusters 398  
           Minimum Distance 398  
           Maximum Distance 398  
           Average Distance 399  
           Centroid Distance 399  
        15.4 Hierarchical (Agglomerative) Clustering 400  
           Single Linkage 401  
           Complete Linkage 402  
           Average Linkage 402  
           Centroid Linkage 402  
           Ward’s Method 402  
           Dendrograms: Displaying Clustering Process and Results 403  
           Validating Clusters 405  
           Limitations of Hierarchical Clustering 407  
        15.5 Non-Hierarchical Clustering: The k-Means Algorithm 408  
           Choosing the Number of Clusters (k) 409  
        Problems 414  
  PART VI FORECASTING TIME SERIES 417  
     CHAPTER 16 Handling Time Series 419  
        16.1 Introduction 419  
        16.2 Descriptive vs. Predictive Modeling 421  
        16.3 Popular Forecasting Methods in Business 421  
           Combining Methods 421  
        16.4 Time Series Components 422  
           Example: Ridership on Amtrak Trains 422  
        16.5 Data-Partitioning and Performance Evaluation 427  
           Benchmark Performance: Naive Forecasts 427  
           Generating Future Forecasts 428  
        Problems 430  
     CHAPTER 17 Regression-Based Forecasting 433  
        17.1 A Model with Trend 433  
           Linear Trend 433  
           Exponential Trend 437  
           Polynomial Trend 439  
        17.2 A Model with Seasonality 439  
        17.3 A Model with Trend and Seasonality 443  
        17.4 Autocorrelation and ARIMA Models 444  
           Computing Autocorrelation 445  
           Improving Forecasts by Integrating Autocorrelation Information 448  
           Evaluating Predictability 452  
        Problems 454  
     CHAPTER 18 Smoothing Methods 465  
        18.1 Introduction 465  
        18.2 Moving Average 466  
           Centered Moving Average for Visualization 466  
           Trailing Moving Average for Forecasting 467  
           Choosing Window Width (w) 471  
        18.3 Simple Exponential Smoothing 471  
           Choosing Smoothing Parameter 472  
           Relation Between Moving Average and Simple Exponential Smoothing 472  
        18.4 Advanced Exponential Smoothing 474  
           Series with a Trend 474  
           Series with a Trend and Seasonality 475  
           Series with Seasonality (No Trend) 475  
        Problems 478  
  PART VII DATA ANALYTICS 485  
     CHAPTER 19 Social Network Analytics 487  
        19.1 Introduction 487  
        19.2 Directed vs. Undirected Networks 489  
        19.3 Visualizing and Analyzing Networks 490  
           Graph Layout 490  
           Edge List 492  
           Adjacency Matrix 493  
           Using Network Data in Classification and Prediction 493  
        19.4 Social Data Metrics and Taxonomy 494  
           Node-Level Centrality Metrics 495  
           Egocentric Network 495  
           Network Metrics 497  
        19.5 Using Network Metrics in Prediction and Classification 499  
           Link Prediction 499  
           Entity Resolution 499  
           Collaborative Filtering 500  
        19.6 Collecting Social Network Data with R 503  
        19.7 Advantages and Disadvantages 506  
        Problems 508  
     CHAPTER 20 Text Mining 511  
        20.1 Introduction 511  
        20.2 The Tabular Representation of Text: Term-Document Matrix and “Bag-of-Words” 512  
        20.3 Bag-of-Words vs. Meaning Extraction at Document Level 513  
        20.4 Preprocessing the Text 514  
           Tokenization 516  
           Text Reduction 517  
           Presence/Absence vs. Frequency 519  
           Term Frequency–Inverse Document Frequency (TF-IDF) 519  
           From Terms to Concepts: Latent Semantic Indexing 520  
           Extracting Meaning 521  
        20.5 Implementing Data Mining Methods 521  
        20.6 Example: Online Discussions on Autos and Electronics 522  
           Importing and Labeling the Records 522  
           Text Preprocessing in R 523  
           Producing a Concept Matrix 523  
           Fitting a Predictive Model 524  
           Prediction 524  
        20.7 Summary 526  
        Problems 527  
  PART VIII CASES 529  
     CHAPTER 21 Cases 531  
        21.1 Charles Book Club 531  
           The Book Industry 531  
           Database Marketing at Charles 532  
           Data Mining Techniques 534  
           Assignment 536  
        21.2 German Credit 537  
           Background 537  
           Data 538  
           Assignment 539  
        21.3 Tayko Software Cataloger 542  
           Background 542  
           The Mailing Experiment 542  
           Data 542  
           Assignment 544  
        21.4 Political Persuasion 545  
           Background 545  
           Predictive Analytics Arrives in US Politics 545  
           Political Targeting 546  
           Uplift 546  
           Data 547  
           Assignment 548  
        21.5 Taxi Cancellations 549  
           Business Situation 549  
           Assignment 549  
        21.6 Segmenting Consumers of Bath Soap 550  
           Business Situation 550  
           Key Problems 551  
           Data 551  
           Measuring Brand Loyalty 551  
           Assignment 553  
        21.7 Direct-Mail Fundraising 553  
           Background 553  
           Data 554  
           Assignment 555  
        21.8 Catalog Cross-Selling 556  
           Background 556  
           Assignment 556  
        21.9 Predicting Bankruptcy 557  
           Predicting Corporate Bankruptcy 557  
           Assignment 558  
        21.10 Time Series Case: Forecasting Public Transportation Demand 560  
           Background 560  
           Problem Description 560  
           Available Data 560  
           Assignment Goal 560  
           Assignment 561  
           Tips and Suggested Steps 561  
  References 563  
  Data Files Used in the Book 565  
  Index 567  
  EULA 577  


nach oben


  Mehr zum Inhalt
Kapitelübersicht
Kurzinformation
Inhaltsverzeichnis
Leseprobe
Blick ins Buch
Fragen zu eBooks?

  Navigation
Computer
Kultur
Medizin / Gesundheit
Philosophie / Religion
Politik
Psychologie / Pädagogik
Ratgeber
Recht
Technik / Wissen
Wirtschaft

© 2008-2024 ciando GmbH | Impressum | Kontakt | F.A.Q. | Datenschutz