## 1. Unsupervised Data Mining Categories

### 1.1. K-Means Clustering

1.1.1. Goal: Increase between-cluster variation, decrease within-cluster variation

1.1.2. Target Variable: Unknown

1.1.3. 5 Steps of Clustering: 1. Choose # of clusters desired, k 2. Start with a partition into k clusters 3. Often based on random selection of k centroids --> At each step, move each record to cluster with closest centroid 4. Recompute centroids, --> Repeat step 3 5. Stop when moving records increases within-cluster dispersion

1.1.4. Unsupervised

### 1.2. Association

1.2.1. Goal: Looking for associated predictors

1.2.2. Unsupervised

### 1.3. Visualization

## 2. Data Mining Process

### 2.1. CRISP_DM

2.1.1. Business Understanding

2.1.1.1. - Define business requirements and objectives - Translate objective into data mining problem definition - Prepare initial strategy to meet objectives

2.1.2. Data Understanding

2.1.2.1. - Collect data - Asses data quality - Perform exploratory data analysis

2.1.3. Data Preparation

2.1.3.1. - Cleanse, prepare, and transform data set - Prepared for modeling in subsequent phases - Select cases and variables appropriate for analysis

2.1.4. Modelling

2.1.4.1. - Select and apply one or more modeling techniques - Calibrate model settings to optimize results - If necessary, additional data preparation may be required

2.1.5. Evaluation

2.1.5.1. - Evaluate one or more models for effectiveness - Determine whether defined objectives achieved - Make decision regarding data mining results before deploying to field

2.1.6. Deployment

2.1.6.1. - Make use of models created - Simple deployment: generate report - Complex deployment: implement additional fata mining effort in another department - In business, customer often carries out deployment based on model

### 2.2. SEMMA

## 3. Data Mining Tasks

### 3.1. Estimation

3.1.1. Numerical Predictor/Categorical (IV’s) values to estimate changes in Numerical Target Variables(DV’s)

### 3.2. Clustering

3.2.1. Similar classification, but no target variables. Clustering tasks do not aim at estimating, predicting or classifying target variable

### 3.3. Classification

3.3.1. Like estimation, but target variables (DV’s) are categorical

### 3.4. Association

3.4.1. Finding attributes of data that go together. Profiling relationships between two or more attributes. Understand the consequent behaviors when based on prior behaviors.

### 3.5. Prediciton

3.5.1. Like estimation, but target variables (DV’s) are categorical

### 3.6. Description

## 4. Supervised Data Mining Methodologies

### 4.1. Multiple Regression

4.1.1. Statistical Approach

4.1.1.1. Explanatory / Descriptive modeling

4.1.1.2. Main Goal: Find the best model to describe the existing data

4.1.1.3. Fit Statistics: adj-R Squared

4.1.2. Data Mining Approach

4.1.2.1. Predictive Modeling

4.1.2.2. Main Goal: Find the best model to predict new data

4.1.2.3. Fit Statistics: MAE, RMSE, ASE

### 4.2. Decison Trees

4.2.1. Target Variables: Know Categorical OR continuous

4.2.2. Goal: Classify an outcome by splitting records based on a set of predictors, resulting in a set of decision rules

4.2.3. Disadvantage of DT: Instability and poor predictive performance

4.2.4. Can be split on categorical or numeric data (numeric data is not sensitive to skews or outliers)

4.2.5. Other Splitting Criterion: This measure is used to evaluate split purity

4.2.5.1. Gini: sum of squares of the proportion of the classes: ranges from 0 (no terms alike) to (k-1)/k (all items alike)

4.2.5.2. Entropy reduction or information gain: low values are good

4.2.5.3. Chi-Square: Compares actual versus expected within a cell; higher values mean that variation is more significant and not due merely to chance

4.2.5.4. Variance Reduction (F-Test): Good split numeric target variable reduces variance of the target variable; large F test means proposed split has successfully split the population into subpopulation with significantly different distributions

### 4.3. Logistic Regression

4.3.1. Target Variables: Known, Categorical

4.3.2. Goal: Similar to multiple regression, but with a categorical target variable. Predict target values in other data where we have predictor values, but not target values

4.3.3. Variable Selection: Stepwise, forward, backward

4.3.4. Supervised

4.3.5. Performance Evaluation: Sensitivity, specificity, false positives, false negatives

### 4.4. Linear Regression

4.4.1. Target Variables: Known, Continuous

4.4.2. Goal: Predict target values in other data where we have predictor values, but not target values

4.4.3. Variable Selection: Stepwise, forward, backward

4.4.4. Supervised

4.4.5. Performance Evaluation: MAE, RMSE, MSE

### 4.5. KNN

4.5.1. Supervised

4.5.2. Target Variables: Categorical OR Continuous

4.5.3. Select k-nearest records (shortest distance) to determine the “best” model

4.5.4. Goal: Classify a target variable based on the classification of k records which are closest to the current record based on the attributes of the records (majority rule) Goal: Make a prediction based on average response of k records which are closest to the current record based on the attributes of the records

4.5.5. Things to consider: Predictors should be standardized or normalized before being used in a KNN model

## 5. Model Evaluation

### 5.1. Continuous Models

5.1.1. MAE: Mean absolute error

5.1.2. MSE: Mean square error

5.1.3. RMSE: Root Mean Square Error

5.1.4. R-Square (Adjusted R-Square)

5.1.5. Other Continuous Model Evaluations

5.1.5.1. AIC (Akaike's Information Criterion)

5.1.5.2. SBC (Schwarz's Bayesian Criterion)

### 5.2. Categorical Models

5.2.1. Classification/Confusion/Coincidence Matrix

5.2.2. Lift/Gain/ROC Charts

5.2.3. Misclassification Rate

## 6. Feature Engineering:

### 6.1. Creating new features from existing predictors to enhance prediction.

6.1.1. Cluster Segment

6.1.2. Principal Component

6.1.2.1. Create combinations of variables that can be used to reduce the number of predictors in a model

6.1.2.2. Goal : Create linear combinations of variables to reduce the number of predictors required

6.1.3. Transformations

6.1.3.1. Interval

6.1.3.1.1. Normalize

6.1.3.1.2. Standardize

6.1.3.2. Nominal

6.1.3.2.1. Nominal Data

6.1.3.2.2. Ordinal Data

6.1.3.2.3. Any Categorical Data

6.1.4. Ensembling

6.1.4.1. Combining models to achieve better outcomes

6.1.4.1.1. Bagging

6.1.4.1.2. Boosting

6.1.4.1.3. Stacked Modeling