Friday 2 December 2016

Employee Attrition -Churn Prediction - IBM WATSON Analytics Sample Data Set , kNN , Naive Bayes , Neural Network , Support Vector Machine - Python 2.7 SkLearn Sci Kit Learn

STEP -1

Feature Reduction - defined as Reducing Number of Features ,utilized for Classification.

Before we proceed with Classification - we "may need" - Feature Reduction.

STEP -2

Factor Analysis - Factor Analysis is not to be considered as a Feature or Dimension Reduction technique.

Quoting Prof Mitra IIT Kanpur - Source - http://textofvideo.nptel.iitm.ac.in/111104024/lec38.pdf

"FA explains the covariance structure or the variance covariance structure, of a random vector in terms of a few underlying unobservable factors."

According to Wiki quoted below - Exploratory Factor Analysis is the better option - compared to PCA

" Clearly though, PCA is a more basic version of exploratory factor analysis (EFA) that was developed in the early days prior to the advent of high-speed computers. From the point of view of exploratory analysis, the eigenvalues of PCA are inflated component loadings, i.e., contaminated with error variance"

Source --- https://en.wikipedia.org/wiki/Factor_analysis EFA , FA - Not done yet for this Data Set.

STEP -3

Principal Component Analysis - PCA

Cant be done for this Data Set as most Features are Categorical.
Source :- http://scikit-learn.org/stable/auto_examples/plot_compare_reduction.html#sphx-glr-auto-examples-plot-compare-reduction-py If PCA were to be undertaken

STEP -4

Pre Processing Data

Standardize Variables :-

  • "Democracy amongst Variables" lets ensure All features have- Mean =0 and Variance =1

Multiple options for SCALING and STANDARDIZATION with scikitlearn

  • Option -1 sklearn.preprocessing.scale - Done. Type= Function.
  • Option -2 sklearn.preprocessing.StandardScaler -Done. Type= Utility Class.
  • Option -3 MinMax Scaler -Not Required with this DataSet.

Dataset Train & Test Split -- k Fold CrossValidation with StartShuffleSplit - Done

STEP -5

Choosing Classifiers :-

  • Logistic Regression - Done
  • kNN - k Nearest Neighbour - Done [Highest Accuracy scores as of NOW].
  • Naive Bayes - Done but rejected - as not a good choice for this Dataset.
  • Neural Network MLP - Multi Layer Perceptron - TBD
  • Support Vector Machine - TBD
  • TPOT and other "Related Projects" -- http://scikit-learn.org/stable/related_projects.html#related-projects
  • Pipeline the Classifiers - discover other options to auto-mate with Pipeline

STEP -6

Model Evaluation

Need to ensure CLASSIFICATION ACCURACY displayed by Model on any Test data set is greater than - Ratio of Classes in Sample or Population [All Train + All Test sets]

STEP -7 - TBD

Data Visualization

Plot AUC and ROC Curves etc - look at own code from R Stats and earlier Python Notebooks incorporate data viz.

STEP -8

Look at Excel worksheets and R Parallels for this Project - using same Sample Data Set. Compare Performance as per Accuracy and Time etc.

STEP 9

Further investigation - Survival Analysis :- predicting when an employee is most likely to Churn or Exit.

Data Source Employee Attrition Data == WATSON Sample Data Sets

Refrence :-

In [3]:
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
import scipy as sp
from sklearn import mixture
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix

%matplotlib inline
In [83]:
# Pre Process- Data 

df=pd.read_csv('hr.tsv',sep='\t')
df.head(5)

# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
# Read TSV with \t 
Out[83]:
AgeAttritionBusinessTravelDailyRateDepartmentDistanceFromHomeEducationEducationFieldEmployeeCountEmployeeNumber...RelationshipSatisfactionStandardHoursStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManager
041YesTravel_Rarely1102Sales12Life Sciences11...18008016405
149NoTravel_Frequently279Research & Development81Life Sciences12...4801103310717
237YesTravel_Rarely1373Research & Development22Other13...28007330000
333NoTravel_Frequently1392Research & Development34Life Sciences14...38008338730
427NoTravel_Rarely591Research & Development21Medical15...48016332222
5 rows × 35 columns
In [84]:
mymap = {'Yes':1,'No':0,'Travel_Rarely':1, 'Travel_Frequently': 2 ,'Non-Travel':3, 'Research & Development' :1 , 
         'Human Resources':2,'Sales':3,'Life Sciences':1,'Medical':6,'Technical Degree':3,'Marketing':4,'Other':5,
        'Female':1, 'Male':2,'Research Scientist':1,'Laboratory Technician':2,'Healthcare Representative':3,
         'Manufacturing Director':4,'Manager':5,'Sales Representative':6,'Research Director':7,'Sales Executive':8,
        'Single':1,'Married':2,'Divorced':3}#Medical = 6 as HR =2 in another column

#
dfh =df.applymap(lambda s: mymap.get(s) if s in mymap else s)
#
dfh.head(5)
#
# In mymap == Yes =1 and No =0 - replacements made in both Attrition and OverTime
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.applymap.html

#dfh.to_csv('dfh_05DEC.csv') # Ok for down csv 
Out[84]:
AgeAttritionBusinessTravelDailyRateDepartmentDistanceFromHomeEducationEducationFieldEmployeeCountEmployeeNumber...RelationshipSatisfactionStandardHoursStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManager
041111102312111...18008016405
14902279181112...4801103310717
237111373122513...28007330000
333021392134114...38008338730
42701591121615...48016332222
5 rows × 35 columns
In [85]:
#InterimDF Dropped-Attr,EmployeeCount,EmployeeNumber ,Over18 and StandardHours
df1 = dfh.drop(df.columns[[1,8,9,21,26]],axis=1,inplace=False) 
print df1.shape
names = df1.columns.values
print "________________________________________"
print names
print "________________________________________"
df2 = pd.DataFrame(dfh["Attrition"]) #  Interim DF only - Attr
names1 = df2.columns.values
print names1
print df2.shape 
print df2["Attrition"].value_counts() # Here - 0 == Live Employee , 1 == Exited Employee / Attrited Employee 
#
(2940, 30)
________________________________________
['Age' 'BusinessTravel' 'DailyRate' 'Department' 'DistanceFromHome'
 'Education' 'EducationField' 'EnvironmentSatisfaction' 'Gender'
 'HourlyRate' 'JobInvolvement' 'JobLevel' 'JobRole' 'JobSatisfaction'
 'MaritalStatus' 'MonthlyIncome' 'MonthlyRate' 'NumCompaniesWorked'
 'OverTime' 'PercentSalaryHike' 'PerformanceRating'
 'RelationshipSatisfaction' 'StockOptionLevel' 'TotalWorkingYears'
 'TrainingTimesLastYear' 'WorkLifeBalance' 'YearsAtCompany'
 'YearsInCurrentRole' 'YearsSinceLastPromotion' 'YearsWithCurrManager']
________________________________________
['Attrition']
(2940, 1)
0    2466
1     474
Name: Attrition, dtype: int64
In [86]:
# Convert DF to Numpy Array 
# 1st Numpy Array == X , only features 
# 2nd Numpy Array == y , only target Labels
import numpy as np

X = df1.iloc[:,0:30].values # All Features of - df1 besides Attr and AGE #TBD --- Need to ADD AGE ??? 
y = df2.iloc[:,0].values # Choosing only 1 - Target Feature from - dfh
#
print X.shape
print y.shape
#
print "_________________________________________________________"
print('Target Variable "Attrition":', (y))
print "_________________________________________________________"
print('Class labels for Target Variable "Attrition":', np.unique(y))
print "_________________________________________________________"
print('Percentage of Class Label ==1 = {:.4f}'.format(df2["Attrition"].mean()))
print('Percentage of Class Label ==0 = {:.4f}'.format(1-df2["Attrition"].mean()))
print "_________________________________________________________"
print "Model that Predicts 83.88% Accuracy is Non Predictor OR NO_Model- as it will always predict Dominant Class"
print "This dataset Dominant Class = ZERO or LIVE EMPLOYEE - we need more than 83.88% Accuracy Score."
(2940, 30)
(2940,)
_________________________________________________________
('Target Variable "Attrition":', array([1, 0, 1, ..., 0, 0, 0]))
_________________________________________________________
('Class labels for Target Variable "Attrition":', array([0, 1]))
_________________________________________________________
Percentage of Class Label ==1 = 0.1612
Percentage of Class Label ==0 = 0.8388
_________________________________________________________
Model that Predicts 83.88% Accuracy is Non Predictor OR NO_Model- as it will always predict Dominant Class
This dataset Dominant Class = ZERO or LIVE EMPLOYEE - we need more than 83.88% Accuracy Score.
In [87]:
'''
Source :-- http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html#sklearn.model_selection.StratifiedShuffleSplit
#          http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation

Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance
there could be several times more negative samples than positive samples. In such cases it is recommended to 
use stratified sampling as implemented in StratifiedKFold and StratifiedShuffleSplit to ensure that relative 
class frequencies is approximately preserved in each train and validation fold.

'''

from sklearn.model_selection import StratifiedShuffleSplit
#
print "__________________________________"

%time sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=123) # test_size=0.3 thus TRAIN_size =0.7 OR 70% 
sss.get_n_splits(X, y)

print "__________________________________"
print(sss)      
print "__________________________________"

for train_index, test_index in sss.split(X, y):
#    print("TRAIN:", train_index, "TEST:", test_index) # Printing INDEX Values not ACTUAL Feature Values 
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
#
print X_train.shape
print y_train.shape
print X_test.shape
print y_test.shape
#
np.savetxt("X_train.csv", X_train, delimiter=",") # Numpy arrays saved as CSV's 
np.savetxt("y_train.csv", y_train, delimiter=",")
np.savetxt("X_test.csv", X_test, delimiter=",")
np.savetxt("y_test.csv", y_test, delimiter=",")
__________________________________
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 59.8 µs
__________________________________
StratifiedShuffleSplit(n_splits=10, random_state=123, test_size=0.3,
            train_size=None)
__________________________________
(2058, 30)
(2058,)
(882, 30)
(882,)
In [88]:
# Source --  http://scikit-learn.org/stable/modules/naive_bayes.html#bernoulli-naive-bayes
# Gaussian Naive bayes -GaussianNB as 1st Classifier without any Feature Scaling 

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_train)
print("Number of mislabeled points out of a total %d points : %d" % (X_train.shape[0],(y_train != y_pred).sum()))


# Cells below with Scaler and STD Scaler data - those are not Best options as regards ACCURACY scores 
# but we can not classify with Non Scaled Data as the Categorical Features we have are all having Diff Scales
# Thus with this data set Naive Bayes is not a Good Choice of Classifier . 
# Source - http://stackoverflow.com/questions/34725726/is-it-possible-apply-pca-on-any-text-classification
# Source - http://stackoverflow.com/questions/16123572/k-fold-cross-validation-for-naive-bayes-classifier
Number of mislabeled points out of a total 2058 points : 386
In [89]:
# 1st RUN - Multinomial Naive Bayes - with X_train_scaled and y_train 
# 2nd RUN - Multinomial Naive Bayes - with X_train_sc and y_train 
#

# Instantiated - Multinomial Naive Bayes - but didnt Fit or Predict as MNB cant be used with Negative Values 
# in the X_train it throws an error with scaled data 
#
'''
# When we use X_train_scaled or X_train_sc
# We seem to be getting a Sparse Matrix or negative Value Error at location --- 
/home/dhankar/anaconda2/lib/python2.7/site-packages/sklearn/naive_bayes.pyc in _count(self, X, Y)
    688         """Count and smooth feature occurrences."""
    689         if np.any((X.data if issparse(X) else X) < 0):
--> 690             raise ValueError("Input X must be non-negative")
    691         self.feature_count_ += safe_sparse_dot(Y.T, X)
    692         self.class_count_ += Y.sum(axis=0)

ValueError: Input X must be non-negative


'''
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

# train the model using X_train_scaled (timing it with an IPython "magic command")
%time nb.fit(X_train, y_train)

# Source - http://stackoverflow.com/questions/34725726/is-it-possible-apply-pca-on-any-text-classification
# Source - http://stackoverflow.com/questions/16123572/k-fold-cross-validation-for-naive-bayes-classifier
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 3.98 ms
Out[89]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
In [90]:
# All Code below this - OK 
#
# How is STANDARD SCALER diff from this SCALER ?? - On the STANDARD SCALER  official documentation page - Scaler
# is mentioned as - "Equivalent function without the object oriented API." But data set pre-processed with 
# STANDARD SCALER provides a higher Accuracy Score upon model evaluation . 
# Other Scalers given - required / not required ?? 
# MIN MAX Scaler Not Required - Not implemented for this data set. 
In [91]:
##### simple scale --- mean ==0 variance ==1

#
# Catch 22 - Docs for Scaler state dont Scale the Target Feature - BUT - MLP - Neural Net requires this ?? 
#

## 1st RUN - Scale X_train ,X_test , y_train and y_test. 
#  2nd RUN - STANDARD SCALER X_train ,X_test , y_train and y_test. 


from sklearn import preprocessing
from sklearn.preprocessing import scale
#
X_train_scaled = preprocessing.scale(X_train)
print X_train_scaled.shape 
print type(X_train_scaled)
print "_________________________________"
y_train_scaled = preprocessing.scale(y_train)
print y_train_scaled.shape 
print type(y_train_scaled)
print "_________________________________"

#print X_train_scaled # Ok Not required 
print "_________________________________"
print X_train_scaled.mean(axis=0) # Means Exponential e-16 or e-17, Why not ZERO's ? format the Floating Points 
print "_________________________________"
print X_train_scaled.std(axis=0)
print "_________________X-train-scaled___________________________________________________________________"
print('Feature == 0 -- Mean after Rescaling = {:.8f}'.format(X_train_scaled[:,0].mean()))
print('Feature == 0 -- Variance after Rescaling = {:.8f}'.format(X_train_scaled[:,0].std()))
print('Feature == 1 -- Mean after Rescaling = {:.8f}'.format(X_train_scaled[:,1].mean()))
print('Feature == 1 -- Variance after Rescaling = {:.8f}'.format(X_train_scaled[:,1].std()))
print('Feature == 2 -- Mean after Rescaling = {:.8f}'.format(X_train_scaled[:,2].mean()))
print('Feature == 3 -- Mean after Rescaling = {:.8f}'.format(X_train_scaled[:,3].mean()))
print('Feature == 4 -- Mean after Rescaling = {:.8f}'.format(X_train_scaled[:,4].mean()))
print('Feature == 5 -- Mean after Rescaling = {:.8f}'.format(X_train_scaled[:,5].mean()))
print('Feature == 6 -- Mean after Rescaling = {:.8f}'.format(X_train_scaled[:,6].mean()))
print('Feature == 7 -- Mean after Rescaling = {:.8f}'.format(X_train_scaled[:,7].mean()))
print('Feature == 8 -- Mean after Rescaling = {:.8f}'.format(X_train_scaled[:,8].mean()))
#
print "_________________y-train-scaled______________________________________________________________"
print('Feature == 0 -- Mean after Rescaling = {:.8f}'.format(y_train_scaled.mean()))
print('Feature == 0 -- Variance after Rescaling = {:.8f}'.format(y_train_scaled.std()))

#
print "_____________________________________________________________________________________________"
#
X_test_scaled = preprocessing.scale(X_test)
print X_test_scaled.shape 
#print type(X_test_scaled)
print "_________________________________"
y_test_scaled = preprocessing.scale(y_test)
print y_test_scaled.shape 
#print type(y_test_scaled)
print "_________________________________"

#print X_test_scaled # Ok Not required 
print "_________________________________"
print X_test_scaled.mean(axis=0) # 
print "_________________________________"
print X_test_scaled.std(axis=0)
print "________________X-Test-Scaled_______________________________________________________________"
print('Feature == 0 -- Mean after Rescaling = {:.8f}'.format(X_test_scaled[:,0].mean()))
print('Feature == 0 -- Variance after Rescaling = {:.8f}'.format(X_test_scaled[:,0].std()))
print('Feature == 1 -- Mean after Rescaling = {:.8f}'.format(X_test_scaled[:,1].mean()))
print('Feature == 1 -- Variance after Rescaling = {:.8f}'.format(X_test_scaled[:,1].std()))
print('Feature == 2 -- Mean after Rescaling = {:.8f}'.format(X_test_scaled[:,2].mean()))
print('Feature == 3 -- Mean after Rescaling = {:.8f}'.format(X_test_scaled[:,3].mean()))
print('Feature == 4 -- Mean after Rescaling = {:.8f}'.format(X_test_scaled[:,4].mean()))
print('Feature == 5 -- Mean after Rescaling = {:.8f}'.format(X_test_scaled[:,5].mean()))
print "________________y-Test-Scaled_______________________________________________________________"
print('Feature == 0 -- Mean after Rescaling = {:.8f}'.format(y_test_scaled.mean()))
print('Feature == 0 -- Variance after Rescaling = {:.8f}'.format(y_test_scaled.std()))


# Need to check - why -0.0000 values for Mean on Rescaling and do these impact the Predictions ? 
(2058, 30)
<type 'numpy.ndarray'>
_________________________________
(2058,)
<type 'numpy.ndarray'>
_________________________________
_________________________________
[  2.54196836e-16   1.38643011e-16  -1.38184464e-16  -2.71945300e-16
  -3.53890332e-17  -1.76750958e-15  -1.31629941e-17   4.57467991e-17
  -5.23282961e-18  -2.06400063e-16  -2.30082663e-16   3.45639859e-16
   3.05422597e-16  -1.41879813e-16  -1.05303953e-16  -1.44037681e-17
  -7.70358833e-17  -7.15872669e-17  -2.51931075e-16   3.88416219e-17
   3.56317934e-16  -1.06598673e-16  -8.05963653e-17   7.41767084e-17
  -1.02714511e-16   2.24849833e-16  -2.83220159e-17  -6.71636378e-17
  -2.10392118e-17  -3.11596122e-16]
_________________________________
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
_________________X-train-scaled___________________________________________________________________
Feature == 0 -- Mean after Rescaling = 0.00000000
Feature == 0 -- Variance after Rescaling = 1.00000000
Feature == 1 -- Mean after Rescaling = -0.00000000
Feature == 1 -- Variance after Rescaling = 1.00000000
Feature == 2 -- Mean after Rescaling = -0.00000000
Feature == 3 -- Mean after Rescaling = -0.00000000
Feature == 4 -- Mean after Rescaling = -0.00000000
Feature == 5 -- Mean after Rescaling = 0.00000000
Feature == 6 -- Mean after Rescaling = -0.00000000
Feature == 7 -- Mean after Rescaling = 0.00000000
Feature == 8 -- Mean after Rescaling = 0.00000000
_________________y-train-scaled______________________________________________________________
Feature == 0 -- Mean after Rescaling = -0.00000000
Feature == 0 -- Variance after Rescaling = 1.00000000
_____________________________________________________________________________________________
(882, 30)
_________________________________
(882,)
_________________________________
_________________________________
[  1.78491638e-16   2.30604148e-16   5.79027881e-17  -5.14579561e-16
   1.78869265e-16   1.06364904e-16  -5.43782706e-17   3.52451754e-18
  -5.28677631e-17  -5.92874200e-17   6.34413157e-17  -1.40477199e-16
  -1.86799430e-16   1.33176413e-16   1.67666334e-16  -7.27561120e-17
   5.55111512e-17   2.46716228e-17   1.62883061e-16  -1.78995141e-16
  -4.63474056e-16  -2.01652753e-16  -2.37558776e-16   2.55527522e-17
   1.56966906e-16   4.87138674e-17  -2.44198715e-17  -7.99310227e-17
   1.76225877e-17   1.96995355e-17]
_________________________________
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
________________X-Test-Scaled_______________________________________________________________
Feature == 0 -- Mean after Rescaling = 0.00000000
Feature == 0 -- Variance after Rescaling = 1.00000000
Feature == 1 -- Mean after Rescaling = -0.00000000
Feature == 1 -- Variance after Rescaling = 1.00000000
Feature == 2 -- Mean after Rescaling = 0.00000000
Feature == 3 -- Mean after Rescaling = -0.00000000
Feature == 4 -- Mean after Rescaling = 0.00000000
Feature == 5 -- Mean after Rescaling = 0.00000000
________________y-Test-Scaled_______________________________________________________________
Feature == 0 -- Mean after Rescaling = 0.00000000
Feature == 0 -- Variance after Rescaling = 1.00000000
In [92]:
# Standardizing and Rescaling - 

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
sc.fit(X_train)

sc1 = StandardScaler()
sc1.fit(y_train)

#sc = preprocessing.StandardScaler().fit(X_train) # single line option chained code 

X_train_sc = sc.transform(X_train)
X_test_sc = sc.transform(X_test) #@Important Note --- Why not do a sc.fit(X_test) ? 
y_train_sc1 = sc1.transform(y_train)
y_test_sc1 = sc1.transform(y_test) 

# The means and STD values for X_test arent same as above with SCALER ? 

print X_train_sc.shape 
print type(X_train_sc)
print "_________________________________"
#print X_test_scaled # Ok Not required 
print "_________________________________"
print X_train_sc.mean(axis=0) # 
print "_________________________________"
print X_train_sc.std(axis=0)
print "________________X-Train-sc_______________________________________________________________"
print('Feature == 0 -- Mean after Rescaling = {:.4f}'.format(X_train_sc[:,0].mean()))
print('Feature == 0 -- Variance after Rescaling = {:.4f}'.format(X_train_sc[:,0].std()))
print('Feature == 1 -- Mean after Rescaling = {:.4f}'.format(X_train_sc[:,1].mean()))
print('Feature == 1 -- Variance after Rescaling = {:.4f}'.format(X_train_sc[:,1].std()))
print('Feature == 2 -- Mean after Rescaling = {:.8f}'.format(X_train_sc[:,2].mean()))
print('Feature == 3 -- Mean after Rescaling = {:.8f}'.format(X_train_sc[:,3].mean()))
print('Feature == 4 -- Mean after Rescaling = {:.8f}'.format(X_train_sc[:,4].mean()))
print('Feature == 5 -- Mean after Rescaling = {:.8f}'.format(X_train_sc[:,5].mean()))
print "________________y-Train-sc_______________________________________________________________"
print('Feature == 0 -- Mean after Rescaling = {:.4f}'.format(y_train_sc1.mean()))
print('Feature == 0 -- Variance after Rescaling = {:.4f}'.format(y_train_sc1.std()))

print X_test_sc.shape 
print type(X_test_sc)
print "_________________________________"
#print X_test_scaled # Ok Not required 
print "_________________________________"
print X_test_sc.mean(axis=0) # 
print "_________________________________"
print X_test_sc.std(axis=0)
print "_____________X-Test-sc__________________________________________________________________"
print('Feature == 0 -- Mean after Rescaling = {:.4f}'.format(X_test_sc[:,0].mean()))
print('Feature == 0 -- Variance after Rescaling = {:.4f}'.format(X_test_sc[:,0].std()))
print('Feature == 1 -- Mean after Rescaling = {:.4f}'.format(X_test_sc[:,1].mean()))
print('Feature == 1 -- Variance after Rescaling = {:.4f}'.format(X_test_sc[:,1].std()))
print('Feature == 2 -- Mean after Rescaling = {:.8f}'.format(X_test_sc[:,2].mean()))
print('Feature == 3 -- Mean after Rescaling = {:.8f}'.format(X_test_sc[:,3].mean()))
print('Feature == 4 -- Mean after Rescaling = {:.8f}'.format(X_test_sc[:,4].mean()))
print('Feature == 5 -- Mean after Rescaling = {:.8f}'.format(X_test_sc[:,5].mean()))
print "____________y-Test-sc___________________________________________________________________"
print('Feature == 0 -- Mean after Rescaling = {:.4f}'.format(y_test_sc1.mean()))
print('Feature == 0 -- Variance after Rescaling = {:.4f}'.format(y_test_sc1.std()))
(2058, 30)
<type 'numpy.ndarray'>
_________________________________
_________________________________
[  2.54196836e-16   1.38643011e-16  -1.38184464e-16  -2.71945300e-16
  -3.53890332e-17  -1.76750958e-15  -1.31629941e-17   4.57467991e-17
  -5.23282961e-18  -2.06400063e-16  -2.30082663e-16   3.45639859e-16
   3.05422597e-16  -1.41879813e-16  -1.05303953e-16  -1.44037681e-17
  -7.70358833e-17  -7.15872669e-17  -2.51931075e-16   3.88416219e-17
   3.56317934e-16  -1.06598673e-16  -8.05963653e-17   7.41767084e-17
  -1.02714511e-16   2.24849833e-16  -2.83220159e-17  -6.71636378e-17
  -2.10392118e-17  -3.11596122e-16]
_________________________________
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
________________X-Train-sc_______________________________________________________________
Feature == 0 -- Mean after Rescaling = 0.0000
Feature == 0 -- Variance after Rescaling = 1.0000
Feature == 1 -- Mean after Rescaling = -0.0000
Feature == 1 -- Variance after Rescaling = 1.0000
Feature == 2 -- Mean after Rescaling = -0.00000000
Feature == 3 -- Mean after Rescaling = -0.00000000
Feature == 4 -- Mean after Rescaling = -0.00000000
Feature == 5 -- Mean after Rescaling = 0.00000000
________________y-Train-sc_______________________________________________________________
Feature == 0 -- Mean after Rescaling = -0.0000
Feature == 0 -- Variance after Rescaling = 1.0000
(882, 30)
<type 'numpy.ndarray'>
_________________________________
_________________________________
[ 0.02284304  0.01171951 -0.03312591  0.03746291 -0.03754614  0.03429636
  0.01149452  0.00355814  0.01255395  0.03384754 -0.03335764  0.09437888
  0.07167473  0.04611772 -0.00484988  0.09476775 -0.05191783 -0.0246839
  0.0086405  -0.02135523  0.01533514  0.02791042  0.00721656  0.0742232
  0.0150133   0.00817128  0.03999379  0.01057196 -0.0022705   0.02927018]
_________________________________
[ 1.01520802  1.00919006  0.98397353  1.01967949  1.00059282  0.98093763
  1.00898393  1.00063203  0.9974045   1.0128535   0.9653119   1.04236777
  1.0045065   0.99827177  0.97780374  1.04620163  0.94088142  0.96962445
  1.00414428  1.03012894  1.01463236  0.96833058  0.99571242  1.0171314
  0.98496728  0.96521821  1.03084062  0.97746002  0.9375137   1.00276723]
_____________X-Test-sc__________________________________________________________________
Feature == 0 -- Mean after Rescaling = 0.0228
Feature == 0 -- Variance after Rescaling = 1.0152
Feature == 1 -- Mean after Rescaling = 0.0117
Feature == 1 -- Variance after Rescaling = 1.0092
Feature == 2 -- Mean after Rescaling = -0.03312591
Feature == 3 -- Mean after Rescaling = 0.03746291
Feature == 4 -- Mean after Rescaling = -0.03754614
Feature == 5 -- Mean after Rescaling = 0.03429636
____________y-Test-sc___________________________________________________________________
Feature == 0 -- Mean after Rescaling = -0.0009
Feature == 0 -- Variance after Rescaling = 0.9992
/home/dhankar/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/home/dhankar/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/home/dhankar/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
In [93]:
'''
###########################################################################################
The disadvantages of Multi-layer Perceptron (MLP) include:

        1/ MLP with hidden layers have a non-convex loss function where there exists more than one local minimum. 
        #Therefore different random weight initializations can lead to different validation accuracy.
        2/ MLP requires tuning a number of hyperparameters such as the number of hidden neurons, layers,
        #and iterations.
        3/ MLP is sensitive to feature scaling.

###########################################################################################
Scaling Data - Train and Test sets both for MLP - Multi-layer Perceptron is sensitive to feature scaling, 
so it is highly recommended to scale your data. For example, scale each attribute on the input 
vector X to [0, 1] or [-1, +1], or standardize it to have mean 0 and variance 1. Note that you must apply
the same scaling to the test set for meaningful results. You can use StandardScaler for standardization.
###########################################################################################

'''

# Neural Network - Multi-layer Perceptron (MLP)
#

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score

#clf = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(5, 2), random_state=1)
#clf = MLPClassifier(solver='adam', alpha=1e-5,hidden_layer_sizes=(5, 2), random_state=1)
#clf.fit(X_train_sc, y_test_sc1)                         

MLPC = MLPClassifier(random_state=2)

MLPC.fit(X_train,y_train)
Out[93]:
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=2, shuffle=True,
       solver='adam', tol=0.0001, validation_fraction=0.1, verbose=False,
       warm_start=False)
In [94]:
scores = cross_val_score(MLPC,X_test,y_test,cv=5,scoring='accuracy')



print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.84 (+/- 0.00)
In [95]:
# Source --  http://scikit-learn.org/stable/modules/naive_bayes.html#bernoulli-naive-bayes
# Gaussian Naive bayes -GaussianNB as 1st Classifier without any Feature Scaling 

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_train)
print("Number of mislabeled points out of a total %d points : %d" % (X_train.shape[0],(y_train != y_pred).sum()))

# 
'''
RAW DATA -- Number of mislabeled points out of a total 2058 points : 399

STD SCALER --- Number of mislabeled points out of a total 2058 points : 412

SCALER --- Number of mislabeled points out of a total 2058 points : 412

'''
Number of mislabeled points out of a total 2058 points : 386
Out[95]:
'\nRAW DATA -- Number of mislabeled points out of a total 2058 points : 399\n\nSTD SCALER --- Number of mislabeled points out of a total 2058 points : 412\n\nSCALER --- Number of mislabeled points out of a total 2058 points : 412\n\n'
In [39]:
# 1st RUN - Naive Bayes - with X_train_scaled and y_train 
# 2nd RUN - Naive Bayes - with X_train_sc and y_train 
#

# Instantiate - Multinomial Naive Bayes
#
'''
# When we use X_train_scaled or X_train_sc
# We seem to be getting a Sparse Matrix or negative Value Error at location --- 
/home/dhankar/anaconda2/lib/python2.7/site-packages/sklearn/naive_bayes.pyc in _count(self, X, Y)
    688         """Count and smooth feature occurrences."""
    689         if np.any((X.data if issparse(X) else X) < 0):
--> 690             raise ValueError("Input X must be non-negative")
    691         self.feature_count_ += safe_sparse_dot(Y.T, X)
    692         self.class_count_ += Y.sum(axis=0)

ValueError: Input X must be non-negative


'''
# Basis these Questions -- http://stats.stackexchange.com/questions/169400/naive-bayes-questions-continus-data-negative-data-and-multinomialnb-in-scikit
# http://stackoverflow.com/questions/34725726/is-it-possible-apply-pca-on-any-text-classification
#
# We use only GaussianNB and not MultinomialNB with Scaler and STD Scaler Data.  

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

# train the model using X_train_scaled (timing it with an IPython "magic command")
%time nb.fit(X_train, y_train) # Just a print of the Fit this is Not used as we cant use Raw Unscaled Data
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 4.47 ms
Out[39]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
In [96]:
# 1st RUN - KNN - with X_train_scaled and y_train 
# 2nd RUN - KNN - with X_train_sc and y_train 
#

from sklearn.neighbors import KNeighborsClassifier

# Instantiate kNN model with 1 Neighbour 
knn = KNeighborsClassifier(n_neighbors=1)

# Fit kNN model with Train data (occurs in-place)
#knn.fit(X_train_scaled, y_train)
knn.fit(X_train_sc, y_train)
Out[96]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')
In [97]:
# 1st RUN - predict kNN Class with with X_test_scaled
# 2nd RUN - predict kNN Class with with X_test_sc

#y_pred_class_kNN = knn.predict(X_test_scaled)
y_pred_class_kNN = knn.predict(X_test_sc)

y_pred_class_kNN.shape
Out[97]:
(882,)
In [99]:
# calculate accuracy MODEL EVAL 
# calculate accuracy of class predictions

'''
The AUC and ROC ==
#########################################################################################################
#####################################
Example of Receiver Operating Characteristic (ROC) metric to evaluate classifier output quality 
using cross-validation.

ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. 
This means that the top left corner of the plot is the “ideal” point - a false positive rate of zero, 
and a true positive rate of one. This is not very realistic, but it does mean that a larger area under 
the curve (AUC) is usually better.

The “steepness” of ROC curves is also important, since it is ideal to maximize the true positive rate 
while minimizing the false positive rate.

#########################################################################################################

__________________________________________
0.950113378685 = 95.01%
_____Model Evaluation with AUC Area Under the Curve __________________
0.890597639893 = 89.05%
             precision    recall  f1-score   support

          0       0.96      0.98      0.97       740
          1       0.88      0.80      0.84       142

avg / total       0.95      0.95      0.95       882

#########################################################################################################

##1st RUN_kNN_1 Neighbour - with  StandardScaler data .######################_________________________________________
0.945578231293 = 94.55%
_____Model Evaluation with AUC Area Under the Curve __________________
0.878800355589 = 87.88%
             precision    recall  f1-score   support

          0       0.96      0.98      0.97       737
          1       0.88      0.78      0.82       145

avg / total       0.94      0.95      0.94       882


##1st RUN_kNN_1 Neighbour - with SCALER data .######################_________________________________________
Accuracy Score :- 
0.943310657596 = 94.33%
_____Model Evaluation with AUC Area Under the Curve __________________
0.871903803865 = 87.19%
             precision    recall  f1-score   support

          0       0.95      0.98      0.97       737
          1       0.87      0.77      0.82       145

avg / total       0.94      0.94      0.94       882

##1st RUN_kNN_1 Neighbour - .######################
kNN of Non PCA Data Set with Seed 123 -- Feature AGE Included 

Accuracy Score :- 
0.922902494331 = 92.29% 
_____Model Evaluation with AUC Area Under the Curve __________________
0.865231834558 = 86.52%
             precision    recall  f1-score   support

          0       0.96      0.95      0.95       737
          1       0.76      0.78      0.77       145

avg / total       0.92      0.92      0.92       882

##########################
# PCA data set --- Exactly same with multiple runs - seed or no seed -- 

Accuracy Score :- 
0.834467120181
precision    recall  f1-score   support

          0       0.84      1.00      0.91       737
          1       0.00      0.00      0.00       145       ##### Notice All ZERO's === NO 1's being Predicted ?? 

avg / total       0.70      0.83      0.76       882

##2nd RUN ...########################
Non PCA Data Set with Seed 123 -- Feature AGE Not Included 

#1 _____Model Evaluation Accuracy Score
0.878684807256 == 87.86% , Accuracy Score - which is OK not Good as a Non Model is supposed to have - 
As calculated above earlier = 1-dfh["Attrition].mean() == 83.88% 

#2_____Model Evaluation with AUC Area Under the Curve __________________
0.672582229916 = 67.25%

0.878684807256
             precision    recall  f1-score   support

          0       0.89      0.98      0.93       737
          1       0.78      0.37      0.50       145      ##### Notice All Non ZERO's === 

avg / total       0.87      0.88      0.86       882

##3rd RUN ...######################
Non PCA Data Set with Seed 123 -- Feature AGE Included 


#1 _____Model Evaluation Accuracy Score
0.863945578231 = 86.39%

#2_____Model Evaluation with AUC Area Under the Curve __________________
0.613905394657 = 61.39%
             precision    recall  f1-score   support

          0       0.87      0.99      0.92       737
          1       0.78      0.24      0.37       145

avg / total       0.85      0.86      0.83       882

'''

from sklearn import metrics

print type(y_pred_class_kNN)
print len(y_pred_class_kNN)
print "__________________________________________"
print len(y_test)
print('Logistic Reg Model predicted classes: {}'.format(y_pred_class_kNN))
print('Actual data - Real classes: {}'.format(y_test))
print "__________________________________________"
print metrics.accuracy_score(y_test, y_pred_class_kNN)
print "_____Model Evaluation with AUC Area Under the Curve __________________"
print metrics.roc_auc_score(y_test, y_pred_class_kNN)
#
print(metrics.classification_report(y_test, y_pred_class_kNN))
<type 'numpy.ndarray'>
882
__________________________________________
882
Logistic Reg Model predicted classes: [0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1
 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1
 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1
 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0
 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1
 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0
 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0
 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0
 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1
 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1
 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Actual data - Real classes: [0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1
 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1
 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1
 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0
 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1
 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0
 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0
 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0
 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0
 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0
 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1
 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1
 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0]
__________________________________________
0.950113378685
_____Model Evaluation with AUC Area Under the Curve __________________
0.890597639893
             precision    recall  f1-score   support

          0       0.96      0.98      0.97       740
          1       0.88      0.80      0.84       142

avg / total       0.95      0.95      0.95       882

In [43]:
# Logistic Regression
#
# import and instantiate a Logistic Regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
In [44]:
#Logistic Regression
#
# 1st RUN - RAW DATA --- Train model using X_train - the MODEL FIT step
# 2nd RUN - RAW DATA --- Train model using X_train_scaled - the MODEL FIT step
#
%time logreg.fit(X_train_scaled, y_train)
print(logreg.intercept_)
print "____________________________"
print(logreg.coef_)

'''


#######################################################################################
# 1st RUN - RAW DATA --- Train model using X_train - the MODEL FIT step
CPU times: user 24 ms, sys: 0 ns, total: 24 ms
Wall time: 326 ms
[-2.36830026]
____________________________
[[-0.22926078  0.02842471 -0.12015534  0.53251037  0.27349992 -0.00501477
  -0.07026319 -0.43308996  0.14688605 -0.05153573 -0.29965872 -0.32116135
  -0.21602175 -0.44955459 -0.43029879 -0.14698227  0.08871293  0.38247998
   0.73392199 -0.2554244   0.18246334 -0.23867643  0.         -0.17085906
  -0.29944854 -0.10888722 -0.15189001  0.5286157  -0.47053308  0.51628589
  -0.46213338]]

#######################################################################################
  
1st Run === Seed=== 123 

CPU times: user 68 ms, sys: 4 ms, total: 72 ms
Wall time: 524 ms
[ 0.00057899]
____________________________
[[ -2.41465872e-02  -4.92080733e-03  -2.34642609e-04   1.68309915e-01
    2.70271877e-02  -9.01246489e-03  -2.26917398e-02  -3.27923361e-01
    4.00104704e-02  -1.85470147e-03  -2.56502435e-01  -5.68589051e-02
    8.84412682e-02  -3.24237113e-01  -3.16467654e-01  -1.25498378e-04
    8.15611102e-06   1.60560664e-01   4.71414431e-01  -3.66879867e-02
    5.26985015e-02  -1.40384273e-01   4.63188771e-02  -2.98077801e-01
   -2.60246042e-02  -1.15840781e-01  -1.15749399e-01   7.66046992e-02
   -1.17914023e-01   1.45985087e-01  -1.31963175e-01]]

'''
CPU times: user 12 ms, sys: 4 ms, total: 16 ms
Wall time: 84.2 ms
[-2.41473661]
____________________________
[[-0.29824261 -0.02786877 -0.11608357  0.57165683  0.30420618 -0.0155586
  -0.14053571 -0.32695513  0.11152219 -0.03875437 -0.37266742 -0.2628613
  -0.24471147 -0.46447519 -0.41347294 -0.37362517  0.00126928  0.3632543
   0.80808625 -0.08624258  0.09537121 -0.2493944  -0.18767589 -0.11238096
  -0.20185816 -0.23627429  0.43214554 -0.47620115  0.56820966 -0.46267434]]
Out[44]:
'\n\n\n#######################################################################################\n# 1st RUN - RAW DATA --- Train model using X_train - the MODEL FIT step\nCPU times: user 24 ms, sys: 0 ns, total: 24 ms\nWall time: 326 ms\n[-2.36830026]\n____________________________\n[[-0.22926078  0.02842471 -0.12015534  0.53251037  0.27349992 -0.00501477\n  -0.07026319 -0.43308996  0.14688605 -0.05153573 -0.29965872 -0.32116135\n  -0.21602175 -0.44955459 -0.43029879 -0.14698227  0.08871293  0.38247998\n   0.73392199 -0.2554244   0.18246334 -0.23867643  0.         -0.17085906\n  -0.29944854 -0.10888722 -0.15189001  0.5286157  -0.47053308  0.51628589\n  -0.46213338]]\n\n#######################################################################################\n  \n1st Run === Seed=== 123 \n\nCPU times: user 68 ms, sys: 4 ms, total: 72 ms\nWall time: 524 ms\n[ 0.00057899]\n____________________________\n[[ -2.41465872e-02  -4.92080733e-03  -2.34642609e-04   1.68309915e-01\n    2.70271877e-02  -9.01246489e-03  -2.26917398e-02  -3.27923361e-01\n    4.00104704e-02  -1.85470147e-03  -2.56502435e-01  -5.68589051e-02\n    8.84412682e-02  -3.24237113e-01  -3.16467654e-01  -1.25498378e-04\n    8.15611102e-06   1.60560664e-01   4.71414431e-01  -3.66879867e-02\n    5.26985015e-02  -1.40384273e-01   4.63188771e-02  -2.98077801e-01\n   -2.60246042e-02  -1.15840781e-01  -1.15749399e-01   7.66046992e-02\n   -1.17914023e-01   1.45985087e-01  -1.31963175e-01]]\n\n'
In [45]:
# make class predictions for X_test # the MODEL PRED 
# make class predictions for X_test_scaled # the MODEL PRED == y_pred_class_scaled

y_pred_class_scaled = logreg.predict(X_test_scaled)
print type(y_pred_class_scaled)


'''
# TBD --- Check 

print y_pred_class.shape
print y_pred_class
print "__________________________________"
print y_test

'''
<type 'numpy.ndarray'>
Out[45]:
'\n# TBD --- Check \n\nprint y_pred_class.shape\nprint y_pred_class\nprint "__________________________________"\nprint y_test\n\n'
In [13]:
# calculate predicted probabilities for X_test(well calibrated)
y_pred_prob = logreg.predict_proba(X_test)[:, 1]
#y_pred_prob #Ok Dont 
In [100]:
# calculate accuracy MODEL EVAL 
# calculate accuracy of class predictions
from sklearn import metrics


print len(y_pred_class_scaled)
print "__________________________________________"
print len(y_test)
print('Logistic Reg Model predicted classes: {}'.format(y_pred_class_scaled))
print('Actual data - Real classes: {}'.format(y_test))
print "__________________________________________"
print metrics.accuracy_score(y_test, y_pred_class_scaled)
print "_____Model Evaluation with AUC Area Under the Curve __________________"
print metrics.roc_auc_score(y_test, y_pred_class_scaled)
#
print(metrics.classification_report(y_test, y_pred_class_scaled))

'''
####################################################################################

# Scaler Data -- 
__________________________________________
0.891156462585 = 89.11% 
_____Model Evaluation with AUC Area Under the Curve __________________
0.704653597259
             precision    recall  f1-score   support

          0       0.90      0.98      0.94       740
          1       0.80      0.43      0.56       142

avg / total       0.88      0.89      0.88       882



# Scaler Data -- 
0.877551020408 = 87.75% 
_____Model Evaluation with AUC Area Under the Curve __________________
0.669133954054 = 66.91% 
             precision    recall  f1-score   support

          0       0.89      0.98      0.93       737
          1       0.78      0.36      0.49       145

avg / total       0.87      0.88      0.86       882


####################################################################################
# PCA data set --- Exactly same with multiple runs - seed or no seed -- 

0.834467120181
precision    recall  f1-score   support

          0       0.84      1.00      0.91       737
          1       0.00      0.00      0.00       145       ##### Notice All ZERO's === NO 1's being Predicted ?? 

avg / total       0.70      0.83      0.76       882

##2nd RUN ...##########################################################################
Non PCA Data Set with Seed 123 -- Feature AGE Not Included 

#1 _____Model Evaluation Accuracy Score
0.878684807256 == 87.86% , Accuracy Score - which is OK not Good as a Non Model is supposed to have - 
As calculated above earlier = 1-dfh["Attrition].mean() == 83.88% 

#2_____Model Evaluation with AUC Area Under the Curve __________________
0.672582229916 = 67.25%

0.878684807256
             precision    recall  f1-score   support

          0       0.89      0.98      0.93       737
          1       0.78      0.37      0.50       145      ##### Notice All Non ZERO's === 

avg / total       0.87      0.88      0.86       882

##3rd RUN ...##############################################################################
Non PCA Data Set with Seed 123 -- Feature AGE Included 


#1 _____Model Evaluation Accuracy Score
0.863945578231 = 86.39%

#2_____Model Evaluation with AUC Area Under the Curve __________________
0.613905394657 = 61.39%
             precision    recall  f1-score   support

          0       0.87      0.99      0.92       737
          1       0.78      0.24      0.37       145

avg / total       0.85      0.86      0.83       882

'''
882
__________________________________________
882
Logistic Reg Model predicted classes: [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0
 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0
 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
Actual data - Real classes: [0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1
 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1
 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1
 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0
 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1
 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0
 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0
 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0
 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0
 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0
 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1
 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1
 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0]
__________________________________________
0.891156462585
_____Model Evaluation with AUC Area Under the Curve __________________
0.704653597259
             precision    recall  f1-score   support

          0       0.90      0.98      0.94       740
          1       0.80      0.43      0.56       142

avg / total       0.88      0.89      0.88       882

Out[100]:
'\n####################################################################################\n\n# Scaler Data -- \n__________________________________________\n0.891156462585 = 89.11% \n_____Model Evaluation with AUC Area Under the Curve __________________\n0.704653597259\n             precision    recall  f1-score   support\n\n          0       0.90      0.98      0.94       740\n          1       0.80      0.43      0.56       142\n\navg / total       0.88      0.89      0.88       882\n\n\n\n# Scaler Data -- \n0.877551020408 = 87.75% \n_____Model Evaluation with AUC Area Under the Curve __________________\n0.669133954054 = 66.91% \n             precision    recall  f1-score   support\n\n          0       0.89      0.98      0.93       737\n          1       0.78      0.36      0.49       145\n\navg / total       0.87      0.88      0.86       882\n\n\n####################################################################################\n# PCA data set --- Exactly same with multiple runs - seed or no seed -- \n\n0.834467120181\nprecision    recall  f1-score   support\n\n          0       0.84      1.00      0.91       737\n          1       0.00      0.00      0.00       145       ##### Notice All ZERO\'s === NO 1\'s being Predicted ?? \n\navg / total       0.70      0.83      0.76       882\n\n##2nd RUN ...##########################################################################\nNon PCA Data Set with Seed 123 -- Feature AGE Not Included \n\n#1 _____Model Evaluation Accuracy Score\n0.878684807256 == 87.86% , Accuracy Score - which is OK not Good as a Non Model is supposed to have - \nAs calculated above earlier = 1-dfh["Attrition].mean() == 83.88% \n\n#2_____Model Evaluation with AUC Area Under the Curve __________________\n0.672582229916 = 67.25%\n\n0.878684807256\n             precision    recall  f1-score   support\n\n          0       0.89      0.98      0.93       737\n          1       0.78      0.37      0.50       145      ##### Notice All Non ZERO\'s === \n\navg / total       0.87      0.88      0.86       882\n\n##3rd RUN ...##############################################################################\nNon PCA Data Set with Seed 123 -- Feature AGE Included \n\n\n#1 _____Model Evaluation Accuracy Score\n0.863945578231 = 86.39%\n\n#2_____Model Evaluation with AUC Area Under the Curve __________________\n0.613905394657 = 61.39%\n             precision    recall  f1-score   support\n\n          0       0.87      0.99      0.92       737\n          1       0.78      0.24      0.37       145\n\navg / total       0.85      0.86      0.83       882\n\n'
In [47]:
# print the confusion matrix
'''

         PRED      PRED
    _____0___________1________
ACTUAL | TN   |     FP
  0    |      |
    ________________________
ACTUAL | FN   |     TP
  1    |      |
    ________________________

2nd RUN ---
array([[722== TN,  15 == FP],
       [ 92== FN,  53 == TP]])
       
3rd RUN ---
[[727  10]
 [110  35]]
       
'''

print metrics.confusion_matrix(y_test, y_pred_class_scaled)

# Total 882 
[[725  15]
 [ 81  61]]
In [41]:
#print the True positives # Check Term -- "True positives" 

X_test[y_test == y_pred_class]

T_Positives = X_test[y_test == y_pred_class]

print type(T_Positives)

print T_Positives.shape
# (775, 30) == 722 + 53 == Diagonal 1 

print T_Positives
<type 'numpy.ndarray'>
(762, 31)
[[  35    1  882 ...,    9    0    8]
 [  31    2  667 ...,    0    0    0]
 [  30    1  317 ...,    4    0    2]
 ..., 
 [  24    1  506 ...,    2    1    2]
 [  31    1 1079 ...,    2    1    4]
 [  36    3 1229 ...,    7    0    7]]
In [42]:
# print the True Negatives # Check Term -- "T_Negatives" 

X_test[y_test != y_pred_class]

T_Negatives = X_test[y_test != y_pred_class]

print type(T_Negatives)

print T_Negatives.shape

print T_Negatives

# 146 = 145 + 1 --- from the Diagonal of Confusion Matrix seen above 


'''

<type 'numpy.ndarray'>
(146, 12)
[[  31  667    1 ...,    0    0    0]
 [  29  992    1 ...,    2    1    5]
 [  26  342    2 ...,    2    1    2]
 ..., 
 [  34  988   23 ...,    2    0    2]
 [  26 1330   21 ...,    1    0    0]
 [  25  383    9 ...,    2    2    2]]



'''
<type 'numpy.ndarray'>
(120, 31)
[[  29    1  992 ...,    2    1    5]
 [  41    1 1085 ...,    7    1    0]
 [  29    1  408 ...,    8    3   10]
 ..., 
 [  21    1  501 ...,    2    1    2]
 [  34    2  988 ...,    2    0    2]
 [  26    1 1330 ...,    1    0    0]]
Out[42]:
"\n\n<type 'numpy.ndarray'>\n(146, 12)\n[[  31  667    1 ...,    0    0    0]\n [  29  992    1 ...,    2    1    5]\n [  26  342    2 ...,    2    1    2]\n ..., \n [  34  988   23 ...,    2    0    2]\n [  26 1330   21 ...,    1    0    0]\n [  25  383    9 ...,    2    2    2]]\n\n\n\n"
In [37]:
# print the false positives

X_test[y_test < y_pred_class]

False_Positives = X_test[y_test < y_pred_class]
print type(False_Positives)

print False_Positives.shape

print False_Positives
<type 'numpy.ndarray'>
(1, 12)
[[   32   267    29    49  2837 15919    13     6     6     2     4     1]]
In [38]:
# print the false negatives 

False_Negatives = X_test[y_test > y_pred_class]
print type(False_Negatives)

print False_Negatives.shape

print False_Negatives
<type 'numpy.ndarray'>
(145, 12)
[[  31  667    1 ...,    0    0    0]
 [  29  992    1 ...,    2    1    5]
 [  26  342    2 ...,    2    1    2]
 ..., 
 [  34  988   23 ...,    2    0    2]
 [  26 1330   21 ...,    1    0    0]
 [  25  383    9 ...,    2    2    2]]
In [14]:
type(y_pred_class)
Out[14]:
numpy.ndarray
In [15]:
y_pred_class.shape
Out[15]:
(882,)
In [43]:
# Correlation from DF Data 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn

corr_df1 = df1.corr(method='pearson')

#print("--------------- CORRELATIONS ---------------")

#print(corr_dfpca.head(len(dfpca))) # Not required as we are plottng the Correlation 

# We can look at Column 1 of the Print out below - see what all Features have a 
# greater than 0.1 Corr value - Negative or Positive both considered . 
In [45]:
print("--------------- CREATE A HEATMAP ---------------")
# Create a mask to display only the lower triangle of the matrix (since it's mirrored around its 
# top-left to bottom-right diagonal).
mask = np.zeros_like(corr_df1)
mask[np.triu_indices_from(mask)] = True
# Create the heatmap using seaborn library. 
# List if colormaps (parameter 'cmap') is available here: http://matplotlib.org/examples/color/colormaps_reference.html
seaborn.heatmap(corr_df1, cmap='RdYlGn_r', vmax=1.0, vmin=-1.0 , mask = mask, linewidths=2.5)
 
# Show the plot we reorient the labels for each column and row to make them easier to read.
plt.yticks(rotation=0) 
plt.xticks(rotation=90) 
plt.show()
--------------- CREATE A HEATMAP ---------------
In [ ]:
# Watch this space for more 
In [8]:
#Sandbox Code - Not required anymore 

#
# We are now using - from sklearn.model_selection import StratifiedShuffleSplit

#
#Add version check for recent scikit-learn 0.18 checks
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version
import random
random.seed(123) # if NO seed - we get non-reproducible results 

#Split data - 70% training, 30% test set:

if Version(sklearn_version) < '0.18':
    from sklearn.cross_validation import train_test_split
else:
    from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
#
print X_train.shape
print X_test.shape
print y_train.shape # Pred Variable Only - Attrition--  Train Set 
print y_test.shape  # Pred Variable Only - Attrition--  Test Set 
(2058, 31)
(882, 31)
(2058,)
(882,)
In [28]:
##### Sandbox from -- http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler
# OK 

X_train1 = np.array([[1,2,3],[4,5,6],[12,13,14]])

print X_train1.shape 

from sklearn import preprocessing
from sklearn.preprocessing import scale

X_trn_scaled = preprocessing.scale(X_train1)

print type(X_trn_scaled)

print "_________________________________"
print X_trn_scaled
print "_________________________________"
print X_trn_scaled.mean(axis=0)
print "_________________________________"
print X_trn_scaled.std(axis=0)
(3, 3)
<type 'numpy.ndarray'>
_________________________________
[[-1.00514142 -1.00514142 -1.00514142]
 [-0.35897908 -0.35897908 -0.35897908]
 [ 1.3641205   1.3641205   1.3641205 ]]
_________________________________
[ 0.  0.  0.]
_________________________________
[ 1.  1.  1.]

No comments:

Post a Comment