STEP -1

Feature Reduction - defined as Reducing Number of Features ,utilized for Classification.¶

Before we proceed with Classification - we "may need" - Feature Reduction.

STEP -2

Factor Analysis - Factor Analysis is not to be considered as a Feature or Dimension Reduction technique.

Quoting Prof Mitra IIT Kanpur - Source - http://textofvideo.nptel.iitm.ac.in/111104024/lec38.pdf

"FA explains the covariance structure or the variance covariance structure, of a random vector in terms of a few underlying unobservable factors."

According to Wiki quoted below - Exploratory Factor Analysis is the better option - compared to PCA

" Clearly though, PCA is a more basic version of exploratory factor analysis (EFA) that was developed in the early days prior to the advent of high-speed computers. From the point of view of exploratory analysis, the eigenvalues of PCA are inflated component loadings, i.e., contaminated with error variance"

Source --- https://en.wikipedia.org/wiki/Factor_analysis EFA , FA - Not done yet for this Data Set.

STEP -3

Principal Component Analysis - PCA

Cant be done for this Data Set as most Features are Categorical.
Source :- http://scikit-learn.org/stable/auto_examples/plot_compare_reduction.html#sphx-glr-auto-examples-plot-compare-reduction-py If PCA were to be undertaken

We can not include RESPONSE variable - Attrition.
Cant include any other Binomial Or Categorical Variables.
Categorical to be included only with advanced methods.
Advanced methods for PCA :- http://stats.stackexchange.com/questions/14002/whats-the-difference-between-principal-component-analysis-and-multidimensional/14013#14013
Advanced methods out of scope for now. ### TBD -- Try again with "advanced methods" include Categorical Variables.

STEP -4

Pre Processing Data

Standardize Variables :-

"Democracy amongst Variables" lets ensure All features have- Mean =0 and Variance =1

Multiple options for SCALING and STANDARDIZATION with scikitlearn

Option -1 sklearn.preprocessing.scale - Done. Type= Function.
Option -2 sklearn.preprocessing.StandardScaler -Done. Type= Utility Class.
Option -3 MinMax Scaler -Not Required with this DataSet.

Dataset Train & Test Split -- k Fold CrossValidation with StartShuffleSplit - Done

STEP -5

Choosing Classifiers :-

Logistic Regression - Done
kNN - k Nearest Neighbour - Done [Highest Accuracy scores as of NOW].
Naive Bayes - Done but rejected - as not a good choice for this Dataset.
Neural Network MLP - Multi Layer Perceptron - TBD
Support Vector Machine - TBD
TPOT and other "Related Projects" -- http://scikit-learn.org/stable/related_projects.html#related-projects
Pipeline the Classifiers - discover other options to auto-mate with Pipeline

STEP -6

Model Evaluation

Need to ensure CLASSIFICATION ACCURACY displayed by Model on any Test data set is greater than - Ratio of Classes in Sample or Population [All Train + All Test sets]

STEP -7 - TBD

Data Visualization

Plot AUC and ROC Curves etc - look at own code from R Stats and earlier Python Notebooks incorporate data viz.

STEP -8

Look at Excel worksheets and R Parallels for this Project - using same Sample Data Set. Compare Performance as per Accuracy and Time etc.

STEP 9

Further investigation - Survival Analysis :- predicting when an employee is most likely to Churn or Exit.

Data Source Employee Attrition Data == WATSON Sample Data Sets

https://www.ibm.com/communities/analytics/watson-analytics-blog/guide-to-sample-datasets/

Refrence :-

http://datascience.stackexchange.com/questions/6715/is-it-necessary-to-standardize-your-data-before-clustering

In [3]:

import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
import scipy as sp
from sklearn import mixture
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix

%matplotlib inline

In [83]:

# Pre Process- Data 

df=pd.read_csv('hr.tsv',sep='\t')
df.head(5)

# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
# Read TSV with \t

Out[83]:

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	...	1	80	0	8	0	1	6	4	0	5
1	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	...	4	80	1	10	3	3	10	7	1	7
2	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	3	...	2	80	0	7	3	3	0	0	0	0
3	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	4	...	3	80	0	8	3	3	8	7	3	0
4	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	5	...	4	80	1	6	3	3	2	2	2	2

5 rows × 35 columns

In [84]:

mymap = {'Yes':1,'No':0,'Travel_Rarely':1, 'Travel_Frequently': 2 ,'Non-Travel':3, 'Research & Development' :1 , 
         'Human Resources':2,'Sales':3,'Life Sciences':1,'Medical':6,'Technical Degree':3,'Marketing':4,'Other':5,
        'Female':1, 'Male':2,'Research Scientist':1,'Laboratory Technician':2,'Healthcare Representative':3,
         'Manufacturing Director':4,'Manager':5,'Sales Representative':6,'Research Director':7,'Sales Executive':8,
        'Single':1,'Married':2,'Divorced':3}#Medical = 6 as HR =2 in another column

#
dfh =df.applymap(lambda s: mymap.get(s) if s in mymap else s)
#
dfh.head(5)
#
# In mymap == Yes =1 and No =0 - replacements made in both Attrition and OverTime
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.applymap.html

#dfh.to_csv('dfh_05DEC.csv') # Ok for down csv

Out[84]:

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	41	1	1	1102	3	1	2	1	1	1	...	1	80	0	8	0	1	6	4	0	5
1	49	0	2	279	1	8	1	1	1	2	...	4	80	1	10	3	3	10	7	1	7
2	37	1	1	1373	1	2	2	5	1	3	...	2	80	0	7	3	3	0	0	0	0
3	33	0	2	1392	1	3	4	1	1	4	...	3	80	0	8	3	3	8	7	3	0
4	27	0	1	591	1	2	1	6	1	5	...	4	80	1	6	3	3	2	2	2	2