Kaggle_Titanic [Multiple Imputation of Missing_Values]
Main Source - MEGAN R SCRIPT - to be Inspired by MANY other KAGGLE Scripts - WIP
8 January 2017
# Load packages
suppressMessages(library('ggplot2')) # visualization
suppressMessages(library('ggthemes')) # visualization
suppressMessages(library('scales')) # visualization
suppressMessages(library('dplyr')) # data manipulation
suppressMessages(library('mice')) # imputation
suppressMessages(library('randomForest')) # classification algorithm
train <- read.csv('train.csv', stringsAsFactors = F)
test <- read.csv('test.csv', stringsAsFactors = F)
full <- bind_rows(train, test) # bind training & test data
# check data
str(full)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
summary(train)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
# Load packages
suppressMessages(library('ggplot2')) # visualization
suppressMessages(library('ggthemes')) # visualization
suppressMessages(library('scales')) # visualization
suppressMessages(library('dplyr')) # data manipulation
suppressMessages(library('mice')) # imputation
suppressMessages(library('randomForest')) # classification algorithm
train <- read.csv('train.csv', stringsAsFactors = F)
test <- read.csv('test.csv', stringsAsFactors = F)
full <- bind_rows(train, test) # bind training & test data
# check data
str(full)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
summary(train)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
Why dont we see CABIN NA’s or Missing Values in SUMMARY above
We can see the same in Python code
Also seen DF of R the Values are missing
Also Megan in her analysis on Kaggle has suddenly declared - 2.3 Treat a few more variables …
to check CABIN for missing values - how does she know from Code or Summary etc of these Missing Values - In Pythin we get it from Code -
CABIN Py Code - Code Cell [4] - https://github.com/RohitDhankar/KAGGLE_Titanic_initial/blob/master/Titanic_2_OwnCode.ipynb
str(full)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
#
# Grab title from passenger names
full$Title <- gsub('(.*, )|(\\..*)', '', full$Name)
# Show title counts by sex
table(full$Sex, full$Title)
##
## Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme
## female 0 0 0 1 1 0 1 0 0 260 2 1
## male 1 4 1 0 7 1 0 2 61 0 0 0
##
## Mr Mrs Ms Rev Sir the Countess
## female 0 197 2 0 0 1
## male 757 0 0 8 1 0
str(full)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
#
# Grab title from passenger names
full$Title <- gsub('(.*, )|(\\..*)', '', full$Name)
# Show title counts by sex
table(full$Sex, full$Title)
##
## Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme
## female 0 0 0 1 1 0 1 0 0 260 2 1
## male 1 4 1 0 7 1 0 2 61 0 0 0
##
## Mr Mrs Ms Rev Sir the Countess
## female 0 197 2 0 0 1
## male 757 0 0 8 1 0
Dhankar – No Merging of Titles Done - Holding on to RARE Title creation for now
Titles with very low cell counts to be combined to “rare” level
rare_title <- c(‘Dona’, ‘Lady’, ‘the Countess’,‘Capt’, ‘Col’, ‘Don’,
‘Dr’, ‘Major’, ‘Rev’, ‘Sir’, ‘Jonkheer’)
# Also reassign mlle, ms, and mme accordingly
fullTitle == ‘Mlle’] <- ‘Miss’
fullTitle == ‘Ms’] <- ‘Miss’
fullTitle == ‘Mme’] <- ‘Mrs’
fullTitle %in% rare_title] <- ‘Rare Title’
# Show title counts by sex again
table(full$Sex, full$Title)
##
## Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme
## female 0 0 0 1 1 0 1 0 0 260 2 1
## male 1 4 1 0 7 1 0 2 61 0 0 0
##
## Mr Mrs Ms Rev Sir the Countess
## female 0 197 2 0 0 1
## male 757 0 0 8 1 0
#
# Finally, grab surname from passenger name
full$Surname <- sapply(full$Name,
function(x) strsplit(x, split = '[,.]')[[1]][1])
#table(full$Surname, full$Sex)
# 875 Unique Surnames
nlevels(factor(full$Surname))
## [1] 875
#
# Create a family size variable including the passenger themselves
full$Fsize <- full$SibSp + full$Parch + 1
# Create a family variable
full$Family <- paste(full$Surname, full$Fsize, sep='_')
# Use ggplot2 to visualize the relationship between family size & survival
ggplot(full[1:891,], aes(x = Fsize, fill = factor(Survived))) +
geom_bar(stat='count', position='dodge') +
scale_x_continuous(breaks=c(1:11)) +
labs(x = 'Family Size') +
theme_few()
#
# Discretize family size
full$FsizeD[full$Fsize == 1] <- 'singleton'
full$FsizeD[full$Fsize < 5 & full$Fsize > 1] <- 'small'
full$FsizeD[full$Fsize > 4] <- 'large'
# Show family size by survival using a mosaic plot
mosaicplot(table(full$FsizeD, full$Survived), main='Family Size by Survival', shade=TRUE)
# Code Source -- R Script of Megan Risdal on Kaggle - https://www.kaggle.com/mrisdal/titanic/exploring-survival-on-the-titanic#
#
# This CABIN variable appears to have a lot of missing values
full$Cabin[1:28]
## [1] "" "C85" "" "C123" ""
## [6] "" "E46" "" "" ""
## [11] "G6" "C103" "" "" ""
## [16] "" "" "" "" ""
## [21] "" "D56" "" "A6" ""
## [26] "" "" "C23 C25 C27"
# The first character is the deck. For example:
strsplit(full$Cabin[2], NULL)[[1]]
## [1] "C" "8" "5"
# Create a Deck variable. Get passenger deck A - F:
full$Deck<-factor(sapply(full$Cabin, function(x) strsplit(x, NULL)[[1]][1]))
full$Deck[1:10]
## [1] <NA> C <NA> C <NA> <NA> E <NA> <NA> <NA>
## Levels: A B C D E F G T
# 1 to 10 Values of full$Deck with various Levels
#full[!complete.cases(full),]
# !complete.cases -- will give all the CASES or Observations with Data Titanic which have missing values in any Feature
# Too much dump
### Package MICE - Multiple Imputation with MCA ### Source URL - http://juliejosse.com/wp-content/uploads/2016/06/user2016.pdfhttps://arxiv.org/pdf/1606.05333v2.pdf # Source – http://www.ats.ucla.edu/stat/r/faq/R_pmm_mi.htm
suppressMessages(library(mice))
suppressMessages(library(VIM))
## Warning: replacing previous import 'robustbase::sigma' by 'stats::sigma'
## when loading 'VIM'
suppressMessages(library(dplyr))
#Checking the missing values in Data
#md.pattern(full) ### DHANKAR - OK Not Required
#options(warnings=-1) ### DHANKAR - OK Not Required
suppressMessages(md.pattern(train))
## Warning in data.matrix(x): NAs introduced by coercion
## Warning in data.matrix(x): NAs introduced by coercion
## Warning in data.matrix(x): NAs introduced by coercion
## Warning in data.matrix(x): NAs introduced by coercion
## Warning in data.matrix(x): NAs introduced by coercion
## PassengerId Survived Pclass SibSp Parch Fare Age Ticket Name Sex Cabin
## 521 1 1 1 1 1 1 1 1 0 0 0
## 140 1 1 1 1 1 1 0 1 0 0 0
## 193 1 1 1 1 1 1 1 0 0 0 0
## 37 1 1 1 1 1 1 0 0 0 0 0
## 0 0 0 0 0 0 177 230 891 891 891
## Embarked
## 521 0 4
## 140 0 5
## 193 0 5
## 37 0 6
## 891 3971
## Number of observations per patterns for all pairs of variables
#p<-md.pairs(train) ### DHANKAR - OK Not Required
#p ### DHANKAR - OK Not Required
#suppressMessages(md.pattern(test))
#pp<-md.pairs(test)
#pp ### DHANKAR - OK Not Required
#Missing value plot - NAPlot - library(VIM)
NAPlot <- aggr(train, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(train), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Age 0.1986532
## PassengerId 0.0000000
## Survived 0.0000000
## Pclass 0.0000000
## Name 0.0000000
## Sex 0.0000000
## SibSp 0.0000000
## Parch 0.0000000
## Ticket 0.0000000
## Fare 0.0000000
## Cabin 0.0000000
## Embarked 0.0000000
## Margin plot - # PassengerID and AGE - 177 NA's in Vector AGE - bottom Left we see figure 177 in RED Text
## On the X Axis is plotted the Passenfer ID - the RED DOTS seen along the X Axis correspond to the Passenger ID values for which the AGE value is missing
##
## On the Y Axis - there are NO Red dots as no NA's exist in Passenger ID corresponding to AGE.
#
marginplot(train[c(1,6)], col=c("blue", "red", "orange"))
# Show title counts by sex again
table(full$Sex, full$Title)
##
## Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme
## female 0 0 0 1 1 0 1 0 0 260 2 1
## male 1 4 1 0 7 1 0 2 61 0 0 0
##
## Mr Mrs Ms Rev Sir the Countess
## female 0 197 2 0 0 1
## male 757 0 0 8 1 0
#
# Finally, grab surname from passenger name
full$Surname <- sapply(full$Name,
function(x) strsplit(x, split = '[,.]')[[1]][1])
#table(full$Surname, full$Sex)
# 875 Unique Surnames
nlevels(factor(full$Surname))
## [1] 875
#
# Create a family size variable including the passenger themselves
full$Fsize <- full$SibSp + full$Parch + 1
# Create a family variable
full$Family <- paste(full$Surname, full$Fsize, sep='_')
# Use ggplot2 to visualize the relationship between family size & survival
ggplot(full[1:891,], aes(x = Fsize, fill = factor(Survived))) +
geom_bar(stat='count', position='dodge') +
scale_x_continuous(breaks=c(1:11)) +
labs(x = 'Family Size') +
theme_few()
#
# Discretize family size
full$FsizeD[full$Fsize == 1] <- 'singleton'
full$FsizeD[full$Fsize < 5 & full$Fsize > 1] <- 'small'
full$FsizeD[full$Fsize > 4] <- 'large'
# Show family size by survival using a mosaic plot
mosaicplot(table(full$FsizeD, full$Survived), main='Family Size by Survival', shade=TRUE)
# Code Source -- R Script of Megan Risdal on Kaggle - https://www.kaggle.com/mrisdal/titanic/exploring-survival-on-the-titanic#
#
# This CABIN variable appears to have a lot of missing values
full$Cabin[1:28]
## [1] "" "C85" "" "C123" ""
## [6] "" "E46" "" "" ""
## [11] "G6" "C103" "" "" ""
## [16] "" "" "" "" ""
## [21] "" "D56" "" "A6" ""
## [26] "" "" "C23 C25 C27"
# The first character is the deck. For example:
strsplit(full$Cabin[2], NULL)[[1]]
## [1] "C" "8" "5"
# Create a Deck variable. Get passenger deck A - F:
full$Deck<-factor(sapply(full$Cabin, function(x) strsplit(x, NULL)[[1]][1]))
full$Deck[1:10]
## [1] <NA> C <NA> C <NA> <NA> E <NA> <NA> <NA>
## Levels: A B C D E F G T
# 1 to 10 Values of full$Deck with various Levels
#full[!complete.cases(full),]
# !complete.cases -- will give all the CASES or Observations with Data Titanic which have missing values in any Feature
# Too much dump
### Package MICE - Multiple Imputation with MCA ### Source URL - http://juliejosse.com/wp-content/uploads/2016/06/user2016.pdfhttps://arxiv.org/pdf/1606.05333v2.pdf # Source – http://www.ats.ucla.edu/stat/r/faq/R_pmm_mi.htm
suppressMessages(library(mice))
suppressMessages(library(VIM))
## Warning: replacing previous import 'robustbase::sigma' by 'stats::sigma'
## when loading 'VIM'
suppressMessages(library(dplyr))
#Checking the missing values in Data
#md.pattern(full) ### DHANKAR - OK Not Required
#options(warnings=-1) ### DHANKAR - OK Not Required
suppressMessages(md.pattern(train))
## Warning in data.matrix(x): NAs introduced by coercion
## Warning in data.matrix(x): NAs introduced by coercion
## Warning in data.matrix(x): NAs introduced by coercion
## Warning in data.matrix(x): NAs introduced by coercion
## Warning in data.matrix(x): NAs introduced by coercion
## PassengerId Survived Pclass SibSp Parch Fare Age Ticket Name Sex Cabin
## 521 1 1 1 1 1 1 1 1 0 0 0
## 140 1 1 1 1 1 1 0 1 0 0 0
## 193 1 1 1 1 1 1 1 0 0 0 0
## 37 1 1 1 1 1 1 0 0 0 0 0
## 0 0 0 0 0 0 177 230 891 891 891
## Embarked
## 521 0 4
## 140 0 5
## 193 0 5
## 37 0 6
## 891 3971
## Number of observations per patterns for all pairs of variables
#p<-md.pairs(train) ### DHANKAR - OK Not Required
#p ### DHANKAR - OK Not Required
#suppressMessages(md.pattern(test))
#pp<-md.pairs(test)
#pp ### DHANKAR - OK Not Required
#Missing value plot - NAPlot - library(VIM)
NAPlot <- aggr(train, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(train), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Age 0.1986532
## PassengerId 0.0000000
## Survived 0.0000000
## Pclass 0.0000000
## Name 0.0000000
## Sex 0.0000000
## SibSp 0.0000000
## Parch 0.0000000
## Ticket 0.0000000
## Fare 0.0000000
## Cabin 0.0000000
## Embarked 0.0000000
## Margin plot - # PassengerID and AGE - 177 NA's in Vector AGE - bottom Left we see figure 177 in RED Text
## On the X Axis is plotted the Passenfer ID - the RED DOTS seen along the X Axis correspond to the Passenger ID values for which the AGE value is missing
##
## On the Y Axis - there are NO Red dots as no NA's exist in Passenger ID corresponding to AGE.
#
marginplot(train[c(1,6)], col=c("blue", "red", "orange"))
Under the Missing Completely at Random (MCAR) assumption - red and blue box plots should be identical.
Thus for TITANIC we can assume that Values of AGE are not missing at Random .
At the Bottom Left - there is a Square formed by interaction of Plot Boundaries - we would have a NUMERIC Value here in ORANGE - if there were any Observations or ROWS in which both AGE and PASSENGER ID were missing .
### Margin plot - PassengerID and AGE - SWAPPED - Nothing much to see ...
#marginplot(train[c(6,1)], col=c("blue", "red", "orange"))
# The NA's in the CABIN Feature dont seem to Show Up here ...
# marginplot(train[c(1,11)], col=c("blue", "red", "orange")) # CABIN blanks - Still Nothing ??
#
pbox(train, pos=1) # pos == Position 1 == Feature = PassengerId
## Warning in createPlot(main, sub, xlab, ylab, labels, ca$at): not enough
## space to display frequencies
## lib(mice) The Function - mice - by default - creates 5 sets of imputation values for each Missing Observation / Value
imp1 <-mice(train, m=5)
##
## iter imp variable
## 1 1 Age
## 1 2 Age
## 1 3 Age
## 1 4 Age
## 1 5 Age
## 2 1 Age
## 2 2 Age
## 2 3 Age
## 2 4 Age
## 2 5 Age
## 3 1 Age
## 3 2 Age
## 3 3 Age
## 3 4 Age
## 3 5 Age
## 4 1 Age
## 4 2 Age
## 4 3 Age
## 4 4 Age
## 4 5 Age
## 5 1 Age
## 5 2 Age
## 5 3 Age
## 5 4 Age
## 5 5 Age
imp1
## Multiply imputed data set
## Call:
## mice(data = train, m = 5)
## Number of multiple imputations: 5
## Missing cells per column:
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
## Imputation methods:
## PassengerId Survived Pclass Name Sex Age
## "" "" "" "" "" "pmm"
## SibSp Parch Ticket Fare Cabin Embarked
## "" "" "" "" "" ""
## VisitSequence:
## Age
## 6
## PredictorMatrix:
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
## PassengerId 0 0 0 0 0 0 0 0 0
## Survived 0 0 0 0 0 0 0 0 0
## Pclass 0 0 0 0 0 0 0 0 0
## Name 0 0 0 0 0 0 0 0 0
## Sex 0 0 0 0 0 0 0 0 0
## Age 1 1 1 0 0 0 1 1 0
## SibSp 0 0 0 0 0 0 0 0 0
## Parch 0 0 0 0 0 0 0 0 0
## Ticket 0 0 0 0 0 0 0 0 0
## Fare 0 0 0 0 0 0 0 0 0
## Cabin 0 0 0 0 0 0 0 0 0
## Embarked 0 0 0 0 0 0 0 0 0
## Fare Cabin Embarked
## PassengerId 0 0 0
## Survived 0 0 0
## Pclass 0 0 0
## Name 0 0 0
## Sex 0 0 0
## Age 1 0 0
## SibSp 0 0 0
## Parch 0 0 0
## Ticket 0 0 0
## Fare 0 0 0
## Cabin 0 0 0
## Embarked 0 0 0
## Random generator seed value: NA
#
## The predictor matrix tells us which variables in the dataset were used to produce predicted values for matching
## The - "Imputation methods:" -- which have "_" value for all and "pmm"" for AGE is the MICE Package Imputation Method used for Imputations
## For Titanic AGE its - "pmm" Predictive Mean Matching.
## DHANKAR - Pending Question -- How to change these VARIABLES and use another set to Predict Missing values
##
imp1$imp$Age
## 1 2 3 4 5
## 6 50.0 40.50 36.00 19.00 41.00
## 18 14.0 53.00 11.00 36.00 36.00
## 20 24.0 28.00 47.00 20.00 27.00
## 27 18.0 41.00 21.00 44.00 18.00
## 29 40.0 4.00 47.00 15.00 27.00
## 30 28.0 21.00 17.00 30.00 18.00
## 32 39.0 33.00 48.00 0.92 35.00
## 33 48.0 1.00 26.00 27.00 27.00
## 37 24.0 41.00 14.50 27.00 27.00
## 43 18.0 10.00 42.00 40.00 18.00
## 46 35.0 40.50 42.00 32.00 41.00
## 47 31.0 40.00 29.00 0.67 14.50
## 48 40.0 4.00 18.00 13.00 26.00
## 49 29.0 24.00 32.00 29.00 9.00
## 56 36.0 26.00 24.00 51.00 18.00
## 65 51.0 61.00 51.00 40.00 39.00
## 66 12.0 1.00 24.00 20.00 12.00
## 77 35.0 19.00 32.00 38.00 41.00
## 78 20.0 10.00 17.00 37.00 6.00
## 83 26.0 1.00 14.50 18.00 18.00
## 88 17.0 19.00 21.00 26.00 41.00
## 96 35.0 19.00 17.00 32.00 49.00
## 102 17.0 40.50 32.50 20.00 41.00
## 108 54.0 32.00 47.00 20.00 13.00
## 110 24.0 20.00 39.00 20.00 12.00
## 122 28.0 10.00 32.50 23.00 41.00
## 127 14.0 19.00 32.00 42.00 41.00
## 129 12.0 1.00 24.00 14.00 9.00
## 141 32.0 21.00 21.00 44.00 28.00
## 155 35.0 10.00 21.00 19.00 34.50
## 159 28.0 21.00 29.00 38.00 25.00
## 160 5.0 16.00 5.00 17.00 16.00
## 167 22.0 26.00 28.00 24.00 30.00
## 169 51.0 45.50 65.00 47.00 33.00
## 177 2.0 8.00 1.00 14.00 4.00
## 181 16.0 11.00 11.00 3.00 16.00
## 182 22.0 19.00 36.00 26.00 24.00
## 186 65.0 45.50 22.00 54.00 33.00
## 187 21.0 3.00 39.00 21.00 14.00
## 197 28.0 44.00 40.00 39.00 25.00
## 199 40.0 45.00 14.50 22.00 27.00
## 202 11.0 3.00 17.00 3.00 17.00
## 215 18.0 48.00 14.00 14.00 41.00
## 224 28.0 44.00 21.00 20.00 20.00
## 230 9.0 8.00 1.00 14.00 2.00
## 236 35.0 2.00 36.00 22.00 36.00
## 241 31.0 40.00 29.00 45.00 22.00
## 242 43.0 1.00 29.00 30.00 19.00
## 251 28.0 45.00 32.50 14.00 21.00
## 257 30.0 49.00 24.00 41.00 35.00
## 261 28.0 19.00 36.00 14.00 20.00
## 265 35.0 19.00 32.50 14.00 23.50
## 271 54.0 62.00 22.00 47.00 39.00
## 275 26.0 26.00 14.50 22.00 20.00
## 278 35.0 56.00 19.00 28.00 24.00
## 285 24.0 64.00 51.00 33.00 38.00
## 296 51.0 47.00 22.00 38.00 40.00
## 299 51.0 49.00 24.00 51.00 49.00
## 301 40.0 0.42 14.50 27.00 27.00
## 302 10.0 0.75 8.00 9.00 0.75
## 304 25.0 36.00 25.00 27.00 42.00
## 305 20.0 21.00 55.00 20.00 22.00
## 307 23.0 48.00 30.00 58.00 28.00
## 325 3.0 17.00 5.00 3.00 16.00
## 331 10.0 38.00 3.00 4.00 0.75
## 335 23.0 36.50 22.00 35.00 31.00
## 336 20.0 2.00 35.00 14.00 17.00
## 348 24.0 4.00 24.00 20.00 4.00
## 352 54.0 47.00 65.00 62.00 62.00
## 355 17.0 45.00 40.00 28.00 28.00
## 359 26.0 26.00 41.00 21.00 23.00
## 360 24.0 26.00 14.50 29.00 25.00
## 365 14.5 29.00 14.00 17.00 18.00
## 368 54.0 40.00 14.50 29.00 5.00
## 369 40.0 32.00 14.50 15.00 22.00
## 376 35.0 17.00 54.00 38.00 54.00
## 385 35.0 45.00 35.00 17.00 20.00
## 389 20.0 21.00 55.00 18.00 14.00
## 410 2.0 35.00 35.00 24.00 38.00
## 411 17.0 42.00 40.00 50.00 28.00
## 412 35.0 42.00 19.00 35.00 20.00
## 414 56.0 46.00 22.00 24.00 23.00
## 416 17.0 21.00 40.00 28.00 17.00
## 421 35.0 45.00 35.00 17.00 28.00
## 426 20.0 21.00 34.00 50.00 24.00
## 429 35.0 42.00 55.00 24.00 28.00
## 432 24.0 14.00 4.00 25.00 1.00
## 445 40.0 30.00 31.00 29.00 27.00
## 452 31.0 22.00 14.00 20.00 24.00
## 455 20.0 32.00 35.00 50.00 24.00
## 458 27.0 17.00 54.00 48.00 54.00
## 460 28.0 45.00 55.00 18.00 24.00
## 465 14.0 32.00 40.00 18.00 7.00
## 467 36.0 22.00 22.00 24.00 19.00
## 469 21.0 42.00 19.00 7.00 24.00
## 471 14.0 32.00 19.00 50.00 50.00
## 476 51.0 38.00 54.00 24.00 31.00
## 482 56.0 46.00 36.00 24.00 54.00
## 486 2.0 31.00 1.00 9.00 0.75
## 491 41.0 22.00 0.67 30.00 54.00
## 496 17.0 24.00 35.00 0.83 7.00
## 498 14.0 28.00 40.00 18.00 24.00
## 503 35.0 32.00 34.00 7.00 50.00
## 508 51.0 18.00 24.00 51.00 19.00
## 512 20.0 32.00 19.00 24.00 1.00
## 518 50.0 28.00 35.00 44.00 23.00
## 523 35.0 28.00 55.00 50.00 50.00
## 525 22.0 28.00 55.00 33.00 50.00
## 528 27.0 29.00 22.00 51.00 54.00
## 532 14.0 32.00 40.00 18.00 50.00
## 534 32.0 29.00 54.00 39.00 16.00
## 539 17.0 28.00 34.00 0.83 23.00
## 548 14.0 42.00 18.00 28.00 42.00
## 553 35.0 18.00 40.00 7.00 1.00
## 558 24.0 62.00 22.00 46.00 22.00
## 561 20.0 8.00 35.00 7.00 1.00
## 564 14.0 8.00 55.00 18.00 7.00
## 565 14.0 1.00 55.00 50.00 23.00
## 569 22.0 8.00 34.00 24.00 7.00
## 574 48.0 29.00 26.00 26.00 26.00
## 579 14.5 17.00 29.00 24.00 24.00
## 585 14.0 8.00 19.00 24.00 1.00
## 590 14.0 7.00 55.00 33.00 23.00
## 594 41.0 8.00 17.00 19.00 19.00
## 597 36.0 35.00 25.00 41.00 34.00
## 599 24.0 1.00 55.00 50.00 23.00
## 602 33.0 33.00 40.00 50.00 8.00
## 603 22.0 47.00 51.00 46.00 28.00
## 612 28.0 24.00 35.00 50.00 8.00
## 613 24.0 19.00 9.00 15.00 1.00
## 614 29.0 7.00 19.00 7.00 28.00
## 630 20.0 24.00 55.00 24.00 23.00
## 634 37.0 40.00 22.00 38.00 60.00
## 640 18.0 18.00 0.67 14.50 40.00
## 644 26.0 29.00 31.00 24.00 28.00
## 649 59.0 33.00 34.00 7.00 32.00
## 651 38.0 33.00 35.00 50.00 18.00
## 654 48.0 39.00 41.00 0.42 18.00
## 657 28.0 1.00 55.00 1.00 28.00
## 668 33.0 33.00 40.00 7.00 42.00
## 670 64.0 29.00 63.00 27.00 35.00
## 675 23.0 16.00 19.00 21.00 56.00
## 681 17.0 50.00 55.00 7.00 45.00
## 693 15.0 16.00 26.00 24.00 37.00
## 698 24.0 18.00 33.00 40.00 32.00
## 710 15.0 20.00 9.00 4.00 3.00
## 712 51.0 33.00 22.00 62.00 71.00
## 719 14.0 33.00 35.00 42.00 21.00
## 728 40.0 48.00 31.00 26.00 18.00
## 733 16.0 2.00 22.00 33.00 35.00
## 739 36.0 24.00 34.00 7.00 45.00
## 740 19.0 50.00 55.00 23.00 45.00
## 741 48.0 18.00 27.00 28.00 44.00
## 761 21.0 18.00 34.00 21.00 44.00
## 767 37.0 39.00 65.00 54.00 46.00
## 769 31.0 40.00 29.00 22.00 15.00
## 774 25.0 17.00 34.00 1.00 44.00
## 777 45.5 28.00 35.00 18.00 21.00
## 779 45.5 35.00 35.00 18.00 21.00
## 784 27.0 39.00 5.00 48.00 39.00
## 791 24.0 28.00 53.00 1.00 42.00
## 793 16.0 3.00 17.00 3.00 16.00
## 794 24.0 39.00 65.00 61.00 37.00
## 816 24.0 38.00 22.00 39.00 31.00
## 826 27.0 20.00 34.00 8.00 45.00
## 827 32.0 50.00 55.00 36.00 19.00
## 829 54.0 20.00 31.00 40.00 4.00
## 833 42.0 20.00 34.00 8.00 21.00
## 838 22.0 17.00 19.00 32.00 19.00
## 840 19.0 50.00 25.00 44.00 30.00
## 847 11.0 5.00 5.00 16.00 16.00
## 850 48.0 21.00 38.00 26.00 31.00
## 860 22.0 35.00 53.00 23.00 21.00
## 864 16.0 5.00 3.00 3.00 17.00
## 869 22.0 20.00 55.00 2.00 44.00
## 879 22.0 14.00 55.00 28.00 21.00
## 889 15.0 26.00 5.00 48.00 26.00
#
imp1$imp
## $PassengerId
## NULL
##
## $Survived
## NULL
##
## $Pclass
## NULL
##
## $Name
## NULL
##
## $Sex
## NULL
##
## $Age
## 1 2 3 4 5
## 6 50.0 40.50 36.00 19.00 41.00
## 18 14.0 53.00 11.00 36.00 36.00
## 20 24.0 28.00 47.00 20.00 27.00
## 27 18.0 41.00 21.00 44.00 18.00
## 29 40.0 4.00 47.00 15.00 27.00
## 30 28.0 21.00 17.00 30.00 18.00
## 32 39.0 33.00 48.00 0.92 35.00
## 33 48.0 1.00 26.00 27.00 27.00
## 37 24.0 41.00 14.50 27.00 27.00
## 43 18.0 10.00 42.00 40.00 18.00
## 46 35.0 40.50 42.00 32.00 41.00
## 47 31.0 40.00 29.00 0.67 14.50
## 48 40.0 4.00 18.00 13.00 26.00
## 49 29.0 24.00 32.00 29.00 9.00
## 56 36.0 26.00 24.00 51.00 18.00
## 65 51.0 61.00 51.00 40.00 39.00
## 66 12.0 1.00 24.00 20.00 12.00
## 77 35.0 19.00 32.00 38.00 41.00
## 78 20.0 10.00 17.00 37.00 6.00
## 83 26.0 1.00 14.50 18.00 18.00
## 88 17.0 19.00 21.00 26.00 41.00
## 96 35.0 19.00 17.00 32.00 49.00
## 102 17.0 40.50 32.50 20.00 41.00
## 108 54.0 32.00 47.00 20.00 13.00
## 110 24.0 20.00 39.00 20.00 12.00
## 122 28.0 10.00 32.50 23.00 41.00
## 127 14.0 19.00 32.00 42.00 41.00
## 129 12.0 1.00 24.00 14.00 9.00
## 141 32.0 21.00 21.00 44.00 28.00
## 155 35.0 10.00 21.00 19.00 34.50
## 159 28.0 21.00 29.00 38.00 25.00
## 160 5.0 16.00 5.00 17.00 16.00
## 167 22.0 26.00 28.00 24.00 30.00
## 169 51.0 45.50 65.00 47.00 33.00
## 177 2.0 8.00 1.00 14.00 4.00
## 181 16.0 11.00 11.00 3.00 16.00
## 182 22.0 19.00 36.00 26.00 24.00
## 186 65.0 45.50 22.00 54.00 33.00
## 187 21.0 3.00 39.00 21.00 14.00
## 197 28.0 44.00 40.00 39.00 25.00
## 199 40.0 45.00 14.50 22.00 27.00
## 202 11.0 3.00 17.00 3.00 17.00
## 215 18.0 48.00 14.00 14.00 41.00
## 224 28.0 44.00 21.00 20.00 20.00
## 230 9.0 8.00 1.00 14.00 2.00
## 236 35.0 2.00 36.00 22.00 36.00
## 241 31.0 40.00 29.00 45.00 22.00
## 242 43.0 1.00 29.00 30.00 19.00
## 251 28.0 45.00 32.50 14.00 21.00
## 257 30.0 49.00 24.00 41.00 35.00
## 261 28.0 19.00 36.00 14.00 20.00
## 265 35.0 19.00 32.50 14.00 23.50
## 271 54.0 62.00 22.00 47.00 39.00
## 275 26.0 26.00 14.50 22.00 20.00
## 278 35.0 56.00 19.00 28.00 24.00
## 285 24.0 64.00 51.00 33.00 38.00
## 296 51.0 47.00 22.00 38.00 40.00
## 299 51.0 49.00 24.00 51.00 49.00
## 301 40.0 0.42 14.50 27.00 27.00
## 302 10.0 0.75 8.00 9.00 0.75
## 304 25.0 36.00 25.00 27.00 42.00
## 305 20.0 21.00 55.00 20.00 22.00
## 307 23.0 48.00 30.00 58.00 28.00
## 325 3.0 17.00 5.00 3.00 16.00
## 331 10.0 38.00 3.00 4.00 0.75
## 335 23.0 36.50 22.00 35.00 31.00
## 336 20.0 2.00 35.00 14.00 17.00
## 348 24.0 4.00 24.00 20.00 4.00
## 352 54.0 47.00 65.00 62.00 62.00
## 355 17.0 45.00 40.00 28.00 28.00
## 359 26.0 26.00 41.00 21.00 23.00
## 360 24.0 26.00 14.50 29.00 25.00
## 365 14.5 29.00 14.00 17.00 18.00
## 368 54.0 40.00 14.50 29.00 5.00
## 369 40.0 32.00 14.50 15.00 22.00
## 376 35.0 17.00 54.00 38.00 54.00
## 385 35.0 45.00 35.00 17.00 20.00
## 389 20.0 21.00 55.00 18.00 14.00
## 410 2.0 35.00 35.00 24.00 38.00
## 411 17.0 42.00 40.00 50.00 28.00
## 412 35.0 42.00 19.00 35.00 20.00
## 414 56.0 46.00 22.00 24.00 23.00
## 416 17.0 21.00 40.00 28.00 17.00
## 421 35.0 45.00 35.00 17.00 28.00
## 426 20.0 21.00 34.00 50.00 24.00
## 429 35.0 42.00 55.00 24.00 28.00
## 432 24.0 14.00 4.00 25.00 1.00
## 445 40.0 30.00 31.00 29.00 27.00
## 452 31.0 22.00 14.00 20.00 24.00
## 455 20.0 32.00 35.00 50.00 24.00
## 458 27.0 17.00 54.00 48.00 54.00
## 460 28.0 45.00 55.00 18.00 24.00
## 465 14.0 32.00 40.00 18.00 7.00
## 467 36.0 22.00 22.00 24.00 19.00
## 469 21.0 42.00 19.00 7.00 24.00
## 471 14.0 32.00 19.00 50.00 50.00
## 476 51.0 38.00 54.00 24.00 31.00
## 482 56.0 46.00 36.00 24.00 54.00
## 486 2.0 31.00 1.00 9.00 0.75
## 491 41.0 22.00 0.67 30.00 54.00
## 496 17.0 24.00 35.00 0.83 7.00
## 498 14.0 28.00 40.00 18.00 24.00
## 503 35.0 32.00 34.00 7.00 50.00
## 508 51.0 18.00 24.00 51.00 19.00
## 512 20.0 32.00 19.00 24.00 1.00
## 518 50.0 28.00 35.00 44.00 23.00
## 523 35.0 28.00 55.00 50.00 50.00
## 525 22.0 28.00 55.00 33.00 50.00
## 528 27.0 29.00 22.00 51.00 54.00
## 532 14.0 32.00 40.00 18.00 50.00
## 534 32.0 29.00 54.00 39.00 16.00
## 539 17.0 28.00 34.00 0.83 23.00
## 548 14.0 42.00 18.00 28.00 42.00
## 553 35.0 18.00 40.00 7.00 1.00
## 558 24.0 62.00 22.00 46.00 22.00
## 561 20.0 8.00 35.00 7.00 1.00
## 564 14.0 8.00 55.00 18.00 7.00
## 565 14.0 1.00 55.00 50.00 23.00
## 569 22.0 8.00 34.00 24.00 7.00
## 574 48.0 29.00 26.00 26.00 26.00
## 579 14.5 17.00 29.00 24.00 24.00
## 585 14.0 8.00 19.00 24.00 1.00
## 590 14.0 7.00 55.00 33.00 23.00
## 594 41.0 8.00 17.00 19.00 19.00
## 597 36.0 35.00 25.00 41.00 34.00
## 599 24.0 1.00 55.00 50.00 23.00
## 602 33.0 33.00 40.00 50.00 8.00
## 603 22.0 47.00 51.00 46.00 28.00
## 612 28.0 24.00 35.00 50.00 8.00
## 613 24.0 19.00 9.00 15.00 1.00
## 614 29.0 7.00 19.00 7.00 28.00
## 630 20.0 24.00 55.00 24.00 23.00
## 634 37.0 40.00 22.00 38.00 60.00
## 640 18.0 18.00 0.67 14.50 40.00
## 644 26.0 29.00 31.00 24.00 28.00
## 649 59.0 33.00 34.00 7.00 32.00
## 651 38.0 33.00 35.00 50.00 18.00
## 654 48.0 39.00 41.00 0.42 18.00
## 657 28.0 1.00 55.00 1.00 28.00
## 668 33.0 33.00 40.00 7.00 42.00
## 670 64.0 29.00 63.00 27.00 35.00
## 675 23.0 16.00 19.00 21.00 56.00
## 681 17.0 50.00 55.00 7.00 45.00
## 693 15.0 16.00 26.00 24.00 37.00
## 698 24.0 18.00 33.00 40.00 32.00
## 710 15.0 20.00 9.00 4.00 3.00
## 712 51.0 33.00 22.00 62.00 71.00
## 719 14.0 33.00 35.00 42.00 21.00
## 728 40.0 48.00 31.00 26.00 18.00
## 733 16.0 2.00 22.00 33.00 35.00
## 739 36.0 24.00 34.00 7.00 45.00
## 740 19.0 50.00 55.00 23.00 45.00
## 741 48.0 18.00 27.00 28.00 44.00
## 761 21.0 18.00 34.00 21.00 44.00
## 767 37.0 39.00 65.00 54.00 46.00
## 769 31.0 40.00 29.00 22.00 15.00
## 774 25.0 17.00 34.00 1.00 44.00
## 777 45.5 28.00 35.00 18.00 21.00
## 779 45.5 35.00 35.00 18.00 21.00
## 784 27.0 39.00 5.00 48.00 39.00
## 791 24.0 28.00 53.00 1.00 42.00
## 793 16.0 3.00 17.00 3.00 16.00
## 794 24.0 39.00 65.00 61.00 37.00
## 816 24.0 38.00 22.00 39.00 31.00
## 826 27.0 20.00 34.00 8.00 45.00
## 827 32.0 50.00 55.00 36.00 19.00
## 829 54.0 20.00 31.00 40.00 4.00
## 833 42.0 20.00 34.00 8.00 21.00
## 838 22.0 17.00 19.00 32.00 19.00
## 840 19.0 50.00 25.00 44.00 30.00
## 847 11.0 5.00 5.00 16.00 16.00
## 850 48.0 21.00 38.00 26.00 31.00
## 860 22.0 35.00 53.00 23.00 21.00
## 864 16.0 5.00 3.00 3.00 17.00
## 869 22.0 20.00 55.00 2.00 44.00
## 879 22.0 14.00 55.00 28.00 21.00
## 889 15.0 26.00 5.00 48.00 26.00
##
## $SibSp
## NULL
##
## $Parch
## NULL
##
## $Ticket
## NULL
##
## $Fare
## NULL
##
## $Cabin
## NULL
##
## $Embarked
## NULL
# Whats TYPE of - typeof(imp1$imp)
typeof(imp1$imp)
## [1] "list"
# Whats TYPE of - typeof(imp1$imp$Age)
typeof(imp1$imp$Age)
## [1] "list"
#
# DHANKAR -- Random description below - needs more work ...
# We see a TABLE - lets call it the "pmm" table
# Col -1 has values 6,18,20 etc ...these are the INDEX VAlues or Serial Numbers of the OBS or ROWS
# ROWS have Missing Values Imputed for AGE - these BLANKS have been Filled IN with - FIVE sets of Values
# Col-2,3,4,5,6 of Table "pmm" - are the Sets of values generated for Multiple Imputation . ....
#
############################################## RE - CHECK --- # This is given on page - 13 --- complete(imp) AND complete(imp,2)
# Data changed to Long Format
imp_tot2 <- complete(imp1, 'long', inc=TRUE)
#imp_tot2 <- complete(imp1) ############################### CHECK why Wont THis WORK --- In MICE R PACKAGE PDF
# This is given on page - 13 --- complete(imp) AND complete(imp,2)
typeof(imp_tot2)
## [1] "list"
#imp_tot2 ############### Dont Print Large Dump
# Lattice Plots --- Not very Informative
suppressMessages(library(lattice))
library("lattice", lib.loc="/usr/lib/R/library")
##labels observed data in blue and imputed data in red for Age
col_imp<-rep(c("blue", "red")[1+as.numeric(is.na(imp1$imp$Age))],6) # Not sure whats this Numeric - 6
##plots data for AGE by imputation
stripplot(Age~.imp, data=imp_tot2, jit=TRUE,col=col_imp, xlab="imputation Number")
# AGE --- Feature whose Missing Values we are Imputing and Plotting
# .imp --- is MOST PROBABLY --- All Features from within - LIST "imp1$imp"
# data=imp_tot2 is LONG format LIST created above
# jit=TRUE -- Jitter TRUE
# col=col_imp -- Color Scheme for Values
#
# Converting this LIST - imp1$imp$Age -- to transfer the Multiple Imputed Values to Python # DHANKAR --- absolutely arbitary process ...
LAge<-imp1$imp$Age
typeof(LAge)
## [1] "list"
# 1st - List to DF Method ######## Fails -rbindlist gives Error - Source -SO- http://stackoverflow.com/questions/4227223/r-list-to-data-frame
#library(data.table)
#DT <- rbindlist(LAge) # Fails -rbindlist gives Error
#
# 2nd - List to DF Method ######## Not as desired - - Source -SO- http://stackoverflow.com/questions/4227223/r-list-to-data-frame
#library (plyr)
#dfL <- ldply (LAge, data.frame)
#dfL
#
# 3rd - List to DF Method ######## Not as desired
#dfL1<-do.call(rbind.data.frame,LAge)
#dfL1
#
# 4th - List to DF Method ######## As Desired - Source -SO- http://stackoverflow.com/questions/4227223/r-list-to-data-frame
dfL2 <- data.frame(matrix(unlist(LAge), nrow=177, byrow=F),stringsAsFactors=FALSE)
dfL2
## X1 X2 X3 X4 X5
## 1 50.0 40.50 36.00 19.00 41.00
## 2 14.0 53.00 11.00 36.00 36.00
## 3 24.0 28.00 47.00 20.00 27.00
## 4 18.0 41.00 21.00 44.00 18.00
## 5 40.0 4.00 47.00 15.00 27.00
## 6 28.0 21.00 17.00 30.00 18.00
## 7 39.0 33.00 48.00 0.92 35.00
## 8 48.0 1.00 26.00 27.00 27.00
## 9 24.0 41.00 14.50 27.00 27.00
## 10 18.0 10.00 42.00 40.00 18.00
## 11 35.0 40.50 42.00 32.00 41.00
## 12 31.0 40.00 29.00 0.67 14.50
## 13 40.0 4.00 18.00 13.00 26.00
## 14 29.0 24.00 32.00 29.00 9.00
## 15 36.0 26.00 24.00 51.00 18.00
## 16 51.0 61.00 51.00 40.00 39.00
## 17 12.0 1.00 24.00 20.00 12.00
## 18 35.0 19.00 32.00 38.00 41.00
## 19 20.0 10.00 17.00 37.00 6.00
## 20 26.0 1.00 14.50 18.00 18.00
## 21 17.0 19.00 21.00 26.00 41.00
## 22 35.0 19.00 17.00 32.00 49.00
## 23 17.0 40.50 32.50 20.00 41.00
## 24 54.0 32.00 47.00 20.00 13.00
## 25 24.0 20.00 39.00 20.00 12.00
## 26 28.0 10.00 32.50 23.00 41.00
## 27 14.0 19.00 32.00 42.00 41.00
## 28 12.0 1.00 24.00 14.00 9.00
## 29 32.0 21.00 21.00 44.00 28.00
## 30 35.0 10.00 21.00 19.00 34.50
## 31 28.0 21.00 29.00 38.00 25.00
## 32 5.0 16.00 5.00 17.00 16.00
## 33 22.0 26.00 28.00 24.00 30.00
## 34 51.0 45.50 65.00 47.00 33.00
## 35 2.0 8.00 1.00 14.00 4.00
## 36 16.0 11.00 11.00 3.00 16.00
## 37 22.0 19.00 36.00 26.00 24.00
## 38 65.0 45.50 22.00 54.00 33.00
## 39 21.0 3.00 39.00 21.00 14.00
## 40 28.0 44.00 40.00 39.00 25.00
## 41 40.0 45.00 14.50 22.00 27.00
## 42 11.0 3.00 17.00 3.00 17.00
## 43 18.0 48.00 14.00 14.00 41.00
## 44 28.0 44.00 21.00 20.00 20.00
## 45 9.0 8.00 1.00 14.00 2.00
## 46 35.0 2.00 36.00 22.00 36.00
## 47 31.0 40.00 29.00 45.00 22.00
## 48 43.0 1.00 29.00 30.00 19.00
## 49 28.0 45.00 32.50 14.00 21.00
## 50 30.0 49.00 24.00 41.00 35.00
## 51 28.0 19.00 36.00 14.00 20.00
## 52 35.0 19.00 32.50 14.00 23.50
## 53 54.0 62.00 22.00 47.00 39.00
## 54 26.0 26.00 14.50 22.00 20.00
## 55 35.0 56.00 19.00 28.00 24.00
## 56 24.0 64.00 51.00 33.00 38.00
## 57 51.0 47.00 22.00 38.00 40.00
## 58 51.0 49.00 24.00 51.00 49.00
## 59 40.0 0.42 14.50 27.00 27.00
## 60 10.0 0.75 8.00 9.00 0.75
## 61 25.0 36.00 25.00 27.00 42.00
## 62 20.0 21.00 55.00 20.00 22.00
## 63 23.0 48.00 30.00 58.00 28.00
## 64 3.0 17.00 5.00 3.00 16.00
## 65 10.0 38.00 3.00 4.00 0.75
## 66 23.0 36.50 22.00 35.00 31.00
## 67 20.0 2.00 35.00 14.00 17.00
## 68 24.0 4.00 24.00 20.00 4.00
## 69 54.0 47.00 65.00 62.00 62.00
## 70 17.0 45.00 40.00 28.00 28.00
## 71 26.0 26.00 41.00 21.00 23.00
## 72 24.0 26.00 14.50 29.00 25.00
## 73 14.5 29.00 14.00 17.00 18.00
## 74 54.0 40.00 14.50 29.00 5.00
## 75 40.0 32.00 14.50 15.00 22.00
## 76 35.0 17.00 54.00 38.00 54.00
## 77 35.0 45.00 35.00 17.00 20.00
## 78 20.0 21.00 55.00 18.00 14.00
## 79 2.0 35.00 35.00 24.00 38.00
## 80 17.0 42.00 40.00 50.00 28.00
## 81 35.0 42.00 19.00 35.00 20.00
## 82 56.0 46.00 22.00 24.00 23.00
## 83 17.0 21.00 40.00 28.00 17.00
## 84 35.0 45.00 35.00 17.00 28.00
## 85 20.0 21.00 34.00 50.00 24.00
## 86 35.0 42.00 55.00 24.00 28.00
## 87 24.0 14.00 4.00 25.00 1.00
## 88 40.0 30.00 31.00 29.00 27.00
## 89 31.0 22.00 14.00 20.00 24.00
## 90 20.0 32.00 35.00 50.00 24.00
## 91 27.0 17.00 54.00 48.00 54.00
## 92 28.0 45.00 55.00 18.00 24.00
## 93 14.0 32.00 40.00 18.00 7.00
## 94 36.0 22.00 22.00 24.00 19.00
## 95 21.0 42.00 19.00 7.00 24.00
## 96 14.0 32.00 19.00 50.00 50.00
## 97 51.0 38.00 54.00 24.00 31.00
## 98 56.0 46.00 36.00 24.00 54.00
## 99 2.0 31.00 1.00 9.00 0.75
## 100 41.0 22.00 0.67 30.00 54.00
## 101 17.0 24.00 35.00 0.83 7.00
## 102 14.0 28.00 40.00 18.00 24.00
## 103 35.0 32.00 34.00 7.00 50.00
## 104 51.0 18.00 24.00 51.00 19.00
## 105 20.0 32.00 19.00 24.00 1.00
## 106 50.0 28.00 35.00 44.00 23.00
## 107 35.0 28.00 55.00 50.00 50.00
## 108 22.0 28.00 55.00 33.00 50.00
## 109 27.0 29.00 22.00 51.00 54.00
## 110 14.0 32.00 40.00 18.00 50.00
## 111 32.0 29.00 54.00 39.00 16.00
## 112 17.0 28.00 34.00 0.83 23.00
## 113 14.0 42.00 18.00 28.00 42.00
## 114 35.0 18.00 40.00 7.00 1.00
## 115 24.0 62.00 22.00 46.00 22.00
## 116 20.0 8.00 35.00 7.00 1.00
## 117 14.0 8.00 55.00 18.00 7.00
## 118 14.0 1.00 55.00 50.00 23.00
## 119 22.0 8.00 34.00 24.00 7.00
## 120 48.0 29.00 26.00 26.00 26.00
## 121 14.5 17.00 29.00 24.00 24.00
## 122 14.0 8.00 19.00 24.00 1.00
## 123 14.0 7.00 55.00 33.00 23.00
## 124 41.0 8.00 17.00 19.00 19.00
## 125 36.0 35.00 25.00 41.00 34.00
## 126 24.0 1.00 55.00 50.00 23.00
## 127 33.0 33.00 40.00 50.00 8.00
## 128 22.0 47.00 51.00 46.00 28.00
## 129 28.0 24.00 35.00 50.00 8.00
## 130 24.0 19.00 9.00 15.00 1.00
## 131 29.0 7.00 19.00 7.00 28.00
## 132 20.0 24.00 55.00 24.00 23.00
## 133 37.0 40.00 22.00 38.00 60.00
## 134 18.0 18.00 0.67 14.50 40.00
## 135 26.0 29.00 31.00 24.00 28.00
## 136 59.0 33.00 34.00 7.00 32.00
## 137 38.0 33.00 35.00 50.00 18.00
## 138 48.0 39.00 41.00 0.42 18.00
## 139 28.0 1.00 55.00 1.00 28.00
## 140 33.0 33.00 40.00 7.00 42.00
## 141 64.0 29.00 63.00 27.00 35.00
## 142 23.0 16.00 19.00 21.00 56.00
## 143 17.0 50.00 55.00 7.00 45.00
## 144 15.0 16.00 26.00 24.00 37.00
## 145 24.0 18.00 33.00 40.00 32.00
## 146 15.0 20.00 9.00 4.00 3.00
## 147 51.0 33.00 22.00 62.00 71.00
## 148 14.0 33.00 35.00 42.00 21.00
## 149 40.0 48.00 31.00 26.00 18.00
## 150 16.0 2.00 22.00 33.00 35.00
## 151 36.0 24.00 34.00 7.00 45.00
## 152 19.0 50.00 55.00 23.00 45.00
## 153 48.0 18.00 27.00 28.00 44.00
## 154 21.0 18.00 34.00 21.00 44.00
## 155 37.0 39.00 65.00 54.00 46.00
## 156 31.0 40.00 29.00 22.00 15.00
## 157 25.0 17.00 34.00 1.00 44.00
## 158 45.5 28.00 35.00 18.00 21.00
## 159 45.5 35.00 35.00 18.00 21.00
## 160 27.0 39.00 5.00 48.00 39.00
## 161 24.0 28.00 53.00 1.00 42.00
## 162 16.0 3.00 17.00 3.00 16.00
## 163 24.0 39.00 65.00 61.00 37.00
## 164 24.0 38.00 22.00 39.00 31.00
## 165 27.0 20.00 34.00 8.00 45.00
## 166 32.0 50.00 55.00 36.00 19.00
## 167 54.0 20.00 31.00 40.00 4.00
## 168 42.0 20.00 34.00 8.00 21.00
## 169 22.0 17.00 19.00 32.00 19.00
## 170 19.0 50.00 25.00 44.00 30.00
## 171 11.0 5.00 5.00 16.00 16.00
## 172 48.0 21.00 38.00 26.00 31.00
## 173 22.0 35.00 53.00 23.00 21.00
## 174 16.0 5.00 3.00 3.00 17.00
## 175 22.0 20.00 55.00 2.00 44.00
## 176 22.0 14.00 55.00 28.00 21.00
## 177 15.0 26.00 5.00 48.00 26.00
write.csv(file="Age_IMP.csv", x=dfL2)
# Got to Python with CSV - fill in the Blanks for AGE
# Tangential Analysis -- Inspecting the Titanic Data Present in R
#
dftt<-as.data.frame(Titanic)
summary(dftt)
## Class Sex Age Survived Freq
## 1st :8 Male :16 Child:16 No :16 Min. : 0.00
## 2nd :8 Female:16 Adult:16 Yes:16 1st Qu.: 0.75
## 3rd :8 Median : 13.50
## Crew:8 Mean : 68.78
## 3rd Qu.: 77.00
## Max. :670.00
write.csv(file="Titanic_IMP.csv", x=dftt)
# Back to - Megan Risdal's code --
# Passengers 62 and 830 are missing Embarkment
full[c(62, 830), 'Embarked']
## [1] "" ""
# Get rid of our missing passenger IDs
embark_fare <- full %>%
filter(PassengerId != 62 & PassengerId != 830)
# Use ggplot2 to visualize embarkment, passenger class, & median fare
ggplot(embark_fare, aes(x = Embarked, y = Fare, fill = factor(Pclass))) +
geom_boxplot() +
geom_hline(aes(yintercept=80),
colour='red', linetype='dashed', lwd=2) +
scale_y_continuous(labels=dollar_format()) +
theme_few()
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
### Margin plot - PassengerID and AGE - SWAPPED - Nothing much to see ...
#marginplot(train[c(6,1)], col=c("blue", "red", "orange"))
# The NA's in the CABIN Feature dont seem to Show Up here ...
# marginplot(train[c(1,11)], col=c("blue", "red", "orange")) # CABIN blanks - Still Nothing ??
#
pbox(train, pos=1) # pos == Position 1 == Feature = PassengerId
## Warning in createPlot(main, sub, xlab, ylab, labels, ca$at): not enough
## space to display frequencies
## lib(mice) The Function - mice - by default - creates 5 sets of imputation values for each Missing Observation / Value
imp1 <-mice(train, m=5)
##
## iter imp variable
## 1 1 Age
## 1 2 Age
## 1 3 Age
## 1 4 Age
## 1 5 Age
## 2 1 Age
## 2 2 Age
## 2 3 Age
## 2 4 Age
## 2 5 Age
## 3 1 Age
## 3 2 Age
## 3 3 Age
## 3 4 Age
## 3 5 Age
## 4 1 Age
## 4 2 Age
## 4 3 Age
## 4 4 Age
## 4 5 Age
## 5 1 Age
## 5 2 Age
## 5 3 Age
## 5 4 Age
## 5 5 Age
imp1
## Multiply imputed data set
## Call:
## mice(data = train, m = 5)
## Number of multiple imputations: 5
## Missing cells per column:
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
## Imputation methods:
## PassengerId Survived Pclass Name Sex Age
## "" "" "" "" "" "pmm"
## SibSp Parch Ticket Fare Cabin Embarked
## "" "" "" "" "" ""
## VisitSequence:
## Age
## 6
## PredictorMatrix:
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
## PassengerId 0 0 0 0 0 0 0 0 0
## Survived 0 0 0 0 0 0 0 0 0
## Pclass 0 0 0 0 0 0 0 0 0
## Name 0 0 0 0 0 0 0 0 0
## Sex 0 0 0 0 0 0 0 0 0
## Age 1 1 1 0 0 0 1 1 0
## SibSp 0 0 0 0 0 0 0 0 0
## Parch 0 0 0 0 0 0 0 0 0
## Ticket 0 0 0 0 0 0 0 0 0
## Fare 0 0 0 0 0 0 0 0 0
## Cabin 0 0 0 0 0 0 0 0 0
## Embarked 0 0 0 0 0 0 0 0 0
## Fare Cabin Embarked
## PassengerId 0 0 0
## Survived 0 0 0
## Pclass 0 0 0
## Name 0 0 0
## Sex 0 0 0
## Age 1 0 0
## SibSp 0 0 0
## Parch 0 0 0
## Ticket 0 0 0
## Fare 0 0 0
## Cabin 0 0 0
## Embarked 0 0 0
## Random generator seed value: NA
#
## The predictor matrix tells us which variables in the dataset were used to produce predicted values for matching
## The - "Imputation methods:" -- which have "_" value for all and "pmm"" for AGE is the MICE Package Imputation Method used for Imputations
## For Titanic AGE its - "pmm" Predictive Mean Matching.
## DHANKAR - Pending Question -- How to change these VARIABLES and use another set to Predict Missing values
##
imp1$imp$Age
## 1 2 3 4 5
## 6 50.0 40.50 36.00 19.00 41.00
## 18 14.0 53.00 11.00 36.00 36.00
## 20 24.0 28.00 47.00 20.00 27.00
## 27 18.0 41.00 21.00 44.00 18.00
## 29 40.0 4.00 47.00 15.00 27.00
## 30 28.0 21.00 17.00 30.00 18.00
## 32 39.0 33.00 48.00 0.92 35.00
## 33 48.0 1.00 26.00 27.00 27.00
## 37 24.0 41.00 14.50 27.00 27.00
## 43 18.0 10.00 42.00 40.00 18.00
## 46 35.0 40.50 42.00 32.00 41.00
## 47 31.0 40.00 29.00 0.67 14.50
## 48 40.0 4.00 18.00 13.00 26.00
## 49 29.0 24.00 32.00 29.00 9.00
## 56 36.0 26.00 24.00 51.00 18.00
## 65 51.0 61.00 51.00 40.00 39.00
## 66 12.0 1.00 24.00 20.00 12.00
## 77 35.0 19.00 32.00 38.00 41.00
## 78 20.0 10.00 17.00 37.00 6.00
## 83 26.0 1.00 14.50 18.00 18.00
## 88 17.0 19.00 21.00 26.00 41.00
## 96 35.0 19.00 17.00 32.00 49.00
## 102 17.0 40.50 32.50 20.00 41.00
## 108 54.0 32.00 47.00 20.00 13.00
## 110 24.0 20.00 39.00 20.00 12.00
## 122 28.0 10.00 32.50 23.00 41.00
## 127 14.0 19.00 32.00 42.00 41.00
## 129 12.0 1.00 24.00 14.00 9.00
## 141 32.0 21.00 21.00 44.00 28.00
## 155 35.0 10.00 21.00 19.00 34.50
## 159 28.0 21.00 29.00 38.00 25.00
## 160 5.0 16.00 5.00 17.00 16.00
## 167 22.0 26.00 28.00 24.00 30.00
## 169 51.0 45.50 65.00 47.00 33.00
## 177 2.0 8.00 1.00 14.00 4.00
## 181 16.0 11.00 11.00 3.00 16.00
## 182 22.0 19.00 36.00 26.00 24.00
## 186 65.0 45.50 22.00 54.00 33.00
## 187 21.0 3.00 39.00 21.00 14.00
## 197 28.0 44.00 40.00 39.00 25.00
## 199 40.0 45.00 14.50 22.00 27.00
## 202 11.0 3.00 17.00 3.00 17.00
## 215 18.0 48.00 14.00 14.00 41.00
## 224 28.0 44.00 21.00 20.00 20.00
## 230 9.0 8.00 1.00 14.00 2.00
## 236 35.0 2.00 36.00 22.00 36.00
## 241 31.0 40.00 29.00 45.00 22.00
## 242 43.0 1.00 29.00 30.00 19.00
## 251 28.0 45.00 32.50 14.00 21.00
## 257 30.0 49.00 24.00 41.00 35.00
## 261 28.0 19.00 36.00 14.00 20.00
## 265 35.0 19.00 32.50 14.00 23.50
## 271 54.0 62.00 22.00 47.00 39.00
## 275 26.0 26.00 14.50 22.00 20.00
## 278 35.0 56.00 19.00 28.00 24.00
## 285 24.0 64.00 51.00 33.00 38.00
## 296 51.0 47.00 22.00 38.00 40.00
## 299 51.0 49.00 24.00 51.00 49.00
## 301 40.0 0.42 14.50 27.00 27.00
## 302 10.0 0.75 8.00 9.00 0.75
## 304 25.0 36.00 25.00 27.00 42.00
## 305 20.0 21.00 55.00 20.00 22.00
## 307 23.0 48.00 30.00 58.00 28.00
## 325 3.0 17.00 5.00 3.00 16.00
## 331 10.0 38.00 3.00 4.00 0.75
## 335 23.0 36.50 22.00 35.00 31.00
## 336 20.0 2.00 35.00 14.00 17.00
## 348 24.0 4.00 24.00 20.00 4.00
## 352 54.0 47.00 65.00 62.00 62.00
## 355 17.0 45.00 40.00 28.00 28.00
## 359 26.0 26.00 41.00 21.00 23.00
## 360 24.0 26.00 14.50 29.00 25.00
## 365 14.5 29.00 14.00 17.00 18.00
## 368 54.0 40.00 14.50 29.00 5.00
## 369 40.0 32.00 14.50 15.00 22.00
## 376 35.0 17.00 54.00 38.00 54.00
## 385 35.0 45.00 35.00 17.00 20.00
## 389 20.0 21.00 55.00 18.00 14.00
## 410 2.0 35.00 35.00 24.00 38.00
## 411 17.0 42.00 40.00 50.00 28.00
## 412 35.0 42.00 19.00 35.00 20.00
## 414 56.0 46.00 22.00 24.00 23.00
## 416 17.0 21.00 40.00 28.00 17.00
## 421 35.0 45.00 35.00 17.00 28.00
## 426 20.0 21.00 34.00 50.00 24.00
## 429 35.0 42.00 55.00 24.00 28.00
## 432 24.0 14.00 4.00 25.00 1.00
## 445 40.0 30.00 31.00 29.00 27.00
## 452 31.0 22.00 14.00 20.00 24.00
## 455 20.0 32.00 35.00 50.00 24.00
## 458 27.0 17.00 54.00 48.00 54.00
## 460 28.0 45.00 55.00 18.00 24.00
## 465 14.0 32.00 40.00 18.00 7.00
## 467 36.0 22.00 22.00 24.00 19.00
## 469 21.0 42.00 19.00 7.00 24.00
## 471 14.0 32.00 19.00 50.00 50.00
## 476 51.0 38.00 54.00 24.00 31.00
## 482 56.0 46.00 36.00 24.00 54.00
## 486 2.0 31.00 1.00 9.00 0.75
## 491 41.0 22.00 0.67 30.00 54.00
## 496 17.0 24.00 35.00 0.83 7.00
## 498 14.0 28.00 40.00 18.00 24.00
## 503 35.0 32.00 34.00 7.00 50.00
## 508 51.0 18.00 24.00 51.00 19.00
## 512 20.0 32.00 19.00 24.00 1.00
## 518 50.0 28.00 35.00 44.00 23.00
## 523 35.0 28.00 55.00 50.00 50.00
## 525 22.0 28.00 55.00 33.00 50.00
## 528 27.0 29.00 22.00 51.00 54.00
## 532 14.0 32.00 40.00 18.00 50.00
## 534 32.0 29.00 54.00 39.00 16.00
## 539 17.0 28.00 34.00 0.83 23.00
## 548 14.0 42.00 18.00 28.00 42.00
## 553 35.0 18.00 40.00 7.00 1.00
## 558 24.0 62.00 22.00 46.00 22.00
## 561 20.0 8.00 35.00 7.00 1.00
## 564 14.0 8.00 55.00 18.00 7.00
## 565 14.0 1.00 55.00 50.00 23.00
## 569 22.0 8.00 34.00 24.00 7.00
## 574 48.0 29.00 26.00 26.00 26.00
## 579 14.5 17.00 29.00 24.00 24.00
## 585 14.0 8.00 19.00 24.00 1.00
## 590 14.0 7.00 55.00 33.00 23.00
## 594 41.0 8.00 17.00 19.00 19.00
## 597 36.0 35.00 25.00 41.00 34.00
## 599 24.0 1.00 55.00 50.00 23.00
## 602 33.0 33.00 40.00 50.00 8.00
## 603 22.0 47.00 51.00 46.00 28.00
## 612 28.0 24.00 35.00 50.00 8.00
## 613 24.0 19.00 9.00 15.00 1.00
## 614 29.0 7.00 19.00 7.00 28.00
## 630 20.0 24.00 55.00 24.00 23.00
## 634 37.0 40.00 22.00 38.00 60.00
## 640 18.0 18.00 0.67 14.50 40.00
## 644 26.0 29.00 31.00 24.00 28.00
## 649 59.0 33.00 34.00 7.00 32.00
## 651 38.0 33.00 35.00 50.00 18.00
## 654 48.0 39.00 41.00 0.42 18.00
## 657 28.0 1.00 55.00 1.00 28.00
## 668 33.0 33.00 40.00 7.00 42.00
## 670 64.0 29.00 63.00 27.00 35.00
## 675 23.0 16.00 19.00 21.00 56.00
## 681 17.0 50.00 55.00 7.00 45.00
## 693 15.0 16.00 26.00 24.00 37.00
## 698 24.0 18.00 33.00 40.00 32.00
## 710 15.0 20.00 9.00 4.00 3.00
## 712 51.0 33.00 22.00 62.00 71.00
## 719 14.0 33.00 35.00 42.00 21.00
## 728 40.0 48.00 31.00 26.00 18.00
## 733 16.0 2.00 22.00 33.00 35.00
## 739 36.0 24.00 34.00 7.00 45.00
## 740 19.0 50.00 55.00 23.00 45.00
## 741 48.0 18.00 27.00 28.00 44.00
## 761 21.0 18.00 34.00 21.00 44.00
## 767 37.0 39.00 65.00 54.00 46.00
## 769 31.0 40.00 29.00 22.00 15.00
## 774 25.0 17.00 34.00 1.00 44.00
## 777 45.5 28.00 35.00 18.00 21.00
## 779 45.5 35.00 35.00 18.00 21.00
## 784 27.0 39.00 5.00 48.00 39.00
## 791 24.0 28.00 53.00 1.00 42.00
## 793 16.0 3.00 17.00 3.00 16.00
## 794 24.0 39.00 65.00 61.00 37.00
## 816 24.0 38.00 22.00 39.00 31.00
## 826 27.0 20.00 34.00 8.00 45.00
## 827 32.0 50.00 55.00 36.00 19.00
## 829 54.0 20.00 31.00 40.00 4.00
## 833 42.0 20.00 34.00 8.00 21.00
## 838 22.0 17.00 19.00 32.00 19.00
## 840 19.0 50.00 25.00 44.00 30.00
## 847 11.0 5.00 5.00 16.00 16.00
## 850 48.0 21.00 38.00 26.00 31.00
## 860 22.0 35.00 53.00 23.00 21.00
## 864 16.0 5.00 3.00 3.00 17.00
## 869 22.0 20.00 55.00 2.00 44.00
## 879 22.0 14.00 55.00 28.00 21.00
## 889 15.0 26.00 5.00 48.00 26.00
#
imp1$imp
## $PassengerId
## NULL
##
## $Survived
## NULL
##
## $Pclass
## NULL
##
## $Name
## NULL
##
## $Sex
## NULL
##
## $Age
## 1 2 3 4 5
## 6 50.0 40.50 36.00 19.00 41.00
## 18 14.0 53.00 11.00 36.00 36.00
## 20 24.0 28.00 47.00 20.00 27.00
## 27 18.0 41.00 21.00 44.00 18.00
## 29 40.0 4.00 47.00 15.00 27.00
## 30 28.0 21.00 17.00 30.00 18.00
## 32 39.0 33.00 48.00 0.92 35.00
## 33 48.0 1.00 26.00 27.00 27.00
## 37 24.0 41.00 14.50 27.00 27.00
## 43 18.0 10.00 42.00 40.00 18.00
## 46 35.0 40.50 42.00 32.00 41.00
## 47 31.0 40.00 29.00 0.67 14.50
## 48 40.0 4.00 18.00 13.00 26.00
## 49 29.0 24.00 32.00 29.00 9.00
## 56 36.0 26.00 24.00 51.00 18.00
## 65 51.0 61.00 51.00 40.00 39.00
## 66 12.0 1.00 24.00 20.00 12.00
## 77 35.0 19.00 32.00 38.00 41.00
## 78 20.0 10.00 17.00 37.00 6.00
## 83 26.0 1.00 14.50 18.00 18.00
## 88 17.0 19.00 21.00 26.00 41.00
## 96 35.0 19.00 17.00 32.00 49.00
## 102 17.0 40.50 32.50 20.00 41.00
## 108 54.0 32.00 47.00 20.00 13.00
## 110 24.0 20.00 39.00 20.00 12.00
## 122 28.0 10.00 32.50 23.00 41.00
## 127 14.0 19.00 32.00 42.00 41.00
## 129 12.0 1.00 24.00 14.00 9.00
## 141 32.0 21.00 21.00 44.00 28.00
## 155 35.0 10.00 21.00 19.00 34.50
## 159 28.0 21.00 29.00 38.00 25.00
## 160 5.0 16.00 5.00 17.00 16.00
## 167 22.0 26.00 28.00 24.00 30.00
## 169 51.0 45.50 65.00 47.00 33.00
## 177 2.0 8.00 1.00 14.00 4.00
## 181 16.0 11.00 11.00 3.00 16.00
## 182 22.0 19.00 36.00 26.00 24.00
## 186 65.0 45.50 22.00 54.00 33.00
## 187 21.0 3.00 39.00 21.00 14.00
## 197 28.0 44.00 40.00 39.00 25.00
## 199 40.0 45.00 14.50 22.00 27.00
## 202 11.0 3.00 17.00 3.00 17.00
## 215 18.0 48.00 14.00 14.00 41.00
## 224 28.0 44.00 21.00 20.00 20.00
## 230 9.0 8.00 1.00 14.00 2.00
## 236 35.0 2.00 36.00 22.00 36.00
## 241 31.0 40.00 29.00 45.00 22.00
## 242 43.0 1.00 29.00 30.00 19.00
## 251 28.0 45.00 32.50 14.00 21.00
## 257 30.0 49.00 24.00 41.00 35.00
## 261 28.0 19.00 36.00 14.00 20.00
## 265 35.0 19.00 32.50 14.00 23.50
## 271 54.0 62.00 22.00 47.00 39.00
## 275 26.0 26.00 14.50 22.00 20.00
## 278 35.0 56.00 19.00 28.00 24.00
## 285 24.0 64.00 51.00 33.00 38.00
## 296 51.0 47.00 22.00 38.00 40.00
## 299 51.0 49.00 24.00 51.00 49.00
## 301 40.0 0.42 14.50 27.00 27.00
## 302 10.0 0.75 8.00 9.00 0.75
## 304 25.0 36.00 25.00 27.00 42.00
## 305 20.0 21.00 55.00 20.00 22.00
## 307 23.0 48.00 30.00 58.00 28.00
## 325 3.0 17.00 5.00 3.00 16.00
## 331 10.0 38.00 3.00 4.00 0.75
## 335 23.0 36.50 22.00 35.00 31.00
## 336 20.0 2.00 35.00 14.00 17.00
## 348 24.0 4.00 24.00 20.00 4.00
## 352 54.0 47.00 65.00 62.00 62.00
## 355 17.0 45.00 40.00 28.00 28.00
## 359 26.0 26.00 41.00 21.00 23.00
## 360 24.0 26.00 14.50 29.00 25.00
## 365 14.5 29.00 14.00 17.00 18.00
## 368 54.0 40.00 14.50 29.00 5.00
## 369 40.0 32.00 14.50 15.00 22.00
## 376 35.0 17.00 54.00 38.00 54.00
## 385 35.0 45.00 35.00 17.00 20.00
## 389 20.0 21.00 55.00 18.00 14.00
## 410 2.0 35.00 35.00 24.00 38.00
## 411 17.0 42.00 40.00 50.00 28.00
## 412 35.0 42.00 19.00 35.00 20.00
## 414 56.0 46.00 22.00 24.00 23.00
## 416 17.0 21.00 40.00 28.00 17.00
## 421 35.0 45.00 35.00 17.00 28.00
## 426 20.0 21.00 34.00 50.00 24.00
## 429 35.0 42.00 55.00 24.00 28.00
## 432 24.0 14.00 4.00 25.00 1.00
## 445 40.0 30.00 31.00 29.00 27.00
## 452 31.0 22.00 14.00 20.00 24.00
## 455 20.0 32.00 35.00 50.00 24.00
## 458 27.0 17.00 54.00 48.00 54.00
## 460 28.0 45.00 55.00 18.00 24.00
## 465 14.0 32.00 40.00 18.00 7.00
## 467 36.0 22.00 22.00 24.00 19.00
## 469 21.0 42.00 19.00 7.00 24.00
## 471 14.0 32.00 19.00 50.00 50.00
## 476 51.0 38.00 54.00 24.00 31.00
## 482 56.0 46.00 36.00 24.00 54.00
## 486 2.0 31.00 1.00 9.00 0.75
## 491 41.0 22.00 0.67 30.00 54.00
## 496 17.0 24.00 35.00 0.83 7.00
## 498 14.0 28.00 40.00 18.00 24.00
## 503 35.0 32.00 34.00 7.00 50.00
## 508 51.0 18.00 24.00 51.00 19.00
## 512 20.0 32.00 19.00 24.00 1.00
## 518 50.0 28.00 35.00 44.00 23.00
## 523 35.0 28.00 55.00 50.00 50.00
## 525 22.0 28.00 55.00 33.00 50.00
## 528 27.0 29.00 22.00 51.00 54.00
## 532 14.0 32.00 40.00 18.00 50.00
## 534 32.0 29.00 54.00 39.00 16.00
## 539 17.0 28.00 34.00 0.83 23.00
## 548 14.0 42.00 18.00 28.00 42.00
## 553 35.0 18.00 40.00 7.00 1.00
## 558 24.0 62.00 22.00 46.00 22.00
## 561 20.0 8.00 35.00 7.00 1.00
## 564 14.0 8.00 55.00 18.00 7.00
## 565 14.0 1.00 55.00 50.00 23.00
## 569 22.0 8.00 34.00 24.00 7.00
## 574 48.0 29.00 26.00 26.00 26.00
## 579 14.5 17.00 29.00 24.00 24.00
## 585 14.0 8.00 19.00 24.00 1.00
## 590 14.0 7.00 55.00 33.00 23.00
## 594 41.0 8.00 17.00 19.00 19.00
## 597 36.0 35.00 25.00 41.00 34.00
## 599 24.0 1.00 55.00 50.00 23.00
## 602 33.0 33.00 40.00 50.00 8.00
## 603 22.0 47.00 51.00 46.00 28.00
## 612 28.0 24.00 35.00 50.00 8.00
## 613 24.0 19.00 9.00 15.00 1.00
## 614 29.0 7.00 19.00 7.00 28.00
## 630 20.0 24.00 55.00 24.00 23.00
## 634 37.0 40.00 22.00 38.00 60.00
## 640 18.0 18.00 0.67 14.50 40.00
## 644 26.0 29.00 31.00 24.00 28.00
## 649 59.0 33.00 34.00 7.00 32.00
## 651 38.0 33.00 35.00 50.00 18.00
## 654 48.0 39.00 41.00 0.42 18.00
## 657 28.0 1.00 55.00 1.00 28.00
## 668 33.0 33.00 40.00 7.00 42.00
## 670 64.0 29.00 63.00 27.00 35.00
## 675 23.0 16.00 19.00 21.00 56.00
## 681 17.0 50.00 55.00 7.00 45.00
## 693 15.0 16.00 26.00 24.00 37.00
## 698 24.0 18.00 33.00 40.00 32.00
## 710 15.0 20.00 9.00 4.00 3.00
## 712 51.0 33.00 22.00 62.00 71.00
## 719 14.0 33.00 35.00 42.00 21.00
## 728 40.0 48.00 31.00 26.00 18.00
## 733 16.0 2.00 22.00 33.00 35.00
## 739 36.0 24.00 34.00 7.00 45.00
## 740 19.0 50.00 55.00 23.00 45.00
## 741 48.0 18.00 27.00 28.00 44.00
## 761 21.0 18.00 34.00 21.00 44.00
## 767 37.0 39.00 65.00 54.00 46.00
## 769 31.0 40.00 29.00 22.00 15.00
## 774 25.0 17.00 34.00 1.00 44.00
## 777 45.5 28.00 35.00 18.00 21.00
## 779 45.5 35.00 35.00 18.00 21.00
## 784 27.0 39.00 5.00 48.00 39.00
## 791 24.0 28.00 53.00 1.00 42.00
## 793 16.0 3.00 17.00 3.00 16.00
## 794 24.0 39.00 65.00 61.00 37.00
## 816 24.0 38.00 22.00 39.00 31.00
## 826 27.0 20.00 34.00 8.00 45.00
## 827 32.0 50.00 55.00 36.00 19.00
## 829 54.0 20.00 31.00 40.00 4.00
## 833 42.0 20.00 34.00 8.00 21.00
## 838 22.0 17.00 19.00 32.00 19.00
## 840 19.0 50.00 25.00 44.00 30.00
## 847 11.0 5.00 5.00 16.00 16.00
## 850 48.0 21.00 38.00 26.00 31.00
## 860 22.0 35.00 53.00 23.00 21.00
## 864 16.0 5.00 3.00 3.00 17.00
## 869 22.0 20.00 55.00 2.00 44.00
## 879 22.0 14.00 55.00 28.00 21.00
## 889 15.0 26.00 5.00 48.00 26.00
##
## $SibSp
## NULL
##
## $Parch
## NULL
##
## $Ticket
## NULL
##
## $Fare
## NULL
##
## $Cabin
## NULL
##
## $Embarked
## NULL
# Whats TYPE of - typeof(imp1$imp)
typeof(imp1$imp)
## [1] "list"
# Whats TYPE of - typeof(imp1$imp$Age)
typeof(imp1$imp$Age)
## [1] "list"
#
# DHANKAR -- Random description below - needs more work ...
# We see a TABLE - lets call it the "pmm" table
# Col -1 has values 6,18,20 etc ...these are the INDEX VAlues or Serial Numbers of the OBS or ROWS
# ROWS have Missing Values Imputed for AGE - these BLANKS have been Filled IN with - FIVE sets of Values
# Col-2,3,4,5,6 of Table "pmm" - are the Sets of values generated for Multiple Imputation . ....
#
############################################## RE - CHECK --- # This is given on page - 13 --- complete(imp) AND complete(imp,2)
# Data changed to Long Format
imp_tot2 <- complete(imp1, 'long', inc=TRUE)
#imp_tot2 <- complete(imp1) ############################### CHECK why Wont THis WORK --- In MICE R PACKAGE PDF
# This is given on page - 13 --- complete(imp) AND complete(imp,2)
typeof(imp_tot2)
## [1] "list"
#imp_tot2 ############### Dont Print Large Dump
# Lattice Plots --- Not very Informative
suppressMessages(library(lattice))
library("lattice", lib.loc="/usr/lib/R/library")
##labels observed data in blue and imputed data in red for Age
col_imp<-rep(c("blue", "red")[1+as.numeric(is.na(imp1$imp$Age))],6) # Not sure whats this Numeric - 6
##plots data for AGE by imputation
stripplot(Age~.imp, data=imp_tot2, jit=TRUE,col=col_imp, xlab="imputation Number")
# AGE --- Feature whose Missing Values we are Imputing and Plotting
# .imp --- is MOST PROBABLY --- All Features from within - LIST "imp1$imp"
# data=imp_tot2 is LONG format LIST created above
# jit=TRUE -- Jitter TRUE
# col=col_imp -- Color Scheme for Values
#
# Converting this LIST - imp1$imp$Age -- to transfer the Multiple Imputed Values to Python # DHANKAR --- absolutely arbitary process ...
LAge<-imp1$imp$Age
typeof(LAge)
## [1] "list"
# 1st - List to DF Method ######## Fails -rbindlist gives Error - Source -SO- http://stackoverflow.com/questions/4227223/r-list-to-data-frame
#library(data.table)
#DT <- rbindlist(LAge) # Fails -rbindlist gives Error
#
# 2nd - List to DF Method ######## Not as desired - - Source -SO- http://stackoverflow.com/questions/4227223/r-list-to-data-frame
#library (plyr)
#dfL <- ldply (LAge, data.frame)
#dfL
#
# 3rd - List to DF Method ######## Not as desired
#dfL1<-do.call(rbind.data.frame,LAge)
#dfL1
#
# 4th - List to DF Method ######## As Desired - Source -SO- http://stackoverflow.com/questions/4227223/r-list-to-data-frame
dfL2 <- data.frame(matrix(unlist(LAge), nrow=177, byrow=F),stringsAsFactors=FALSE)
dfL2
## X1 X2 X3 X4 X5
## 1 50.0 40.50 36.00 19.00 41.00
## 2 14.0 53.00 11.00 36.00 36.00
## 3 24.0 28.00 47.00 20.00 27.00
## 4 18.0 41.00 21.00 44.00 18.00
## 5 40.0 4.00 47.00 15.00 27.00
## 6 28.0 21.00 17.00 30.00 18.00
## 7 39.0 33.00 48.00 0.92 35.00
## 8 48.0 1.00 26.00 27.00 27.00
## 9 24.0 41.00 14.50 27.00 27.00
## 10 18.0 10.00 42.00 40.00 18.00
## 11 35.0 40.50 42.00 32.00 41.00
## 12 31.0 40.00 29.00 0.67 14.50
## 13 40.0 4.00 18.00 13.00 26.00
## 14 29.0 24.00 32.00 29.00 9.00
## 15 36.0 26.00 24.00 51.00 18.00
## 16 51.0 61.00 51.00 40.00 39.00
## 17 12.0 1.00 24.00 20.00 12.00
## 18 35.0 19.00 32.00 38.00 41.00
## 19 20.0 10.00 17.00 37.00 6.00
## 20 26.0 1.00 14.50 18.00 18.00
## 21 17.0 19.00 21.00 26.00 41.00
## 22 35.0 19.00 17.00 32.00 49.00
## 23 17.0 40.50 32.50 20.00 41.00
## 24 54.0 32.00 47.00 20.00 13.00
## 25 24.0 20.00 39.00 20.00 12.00
## 26 28.0 10.00 32.50 23.00 41.00
## 27 14.0 19.00 32.00 42.00 41.00
## 28 12.0 1.00 24.00 14.00 9.00
## 29 32.0 21.00 21.00 44.00 28.00
## 30 35.0 10.00 21.00 19.00 34.50
## 31 28.0 21.00 29.00 38.00 25.00
## 32 5.0 16.00 5.00 17.00 16.00
## 33 22.0 26.00 28.00 24.00 30.00
## 34 51.0 45.50 65.00 47.00 33.00
## 35 2.0 8.00 1.00 14.00 4.00
## 36 16.0 11.00 11.00 3.00 16.00
## 37 22.0 19.00 36.00 26.00 24.00
## 38 65.0 45.50 22.00 54.00 33.00
## 39 21.0 3.00 39.00 21.00 14.00
## 40 28.0 44.00 40.00 39.00 25.00
## 41 40.0 45.00 14.50 22.00 27.00
## 42 11.0 3.00 17.00 3.00 17.00
## 43 18.0 48.00 14.00 14.00 41.00
## 44 28.0 44.00 21.00 20.00 20.00
## 45 9.0 8.00 1.00 14.00 2.00
## 46 35.0 2.00 36.00 22.00 36.00
## 47 31.0 40.00 29.00 45.00 22.00
## 48 43.0 1.00 29.00 30.00 19.00
## 49 28.0 45.00 32.50 14.00 21.00
## 50 30.0 49.00 24.00 41.00 35.00
## 51 28.0 19.00 36.00 14.00 20.00
## 52 35.0 19.00 32.50 14.00 23.50
## 53 54.0 62.00 22.00 47.00 39.00
## 54 26.0 26.00 14.50 22.00 20.00
## 55 35.0 56.00 19.00 28.00 24.00
## 56 24.0 64.00 51.00 33.00 38.00
## 57 51.0 47.00 22.00 38.00 40.00
## 58 51.0 49.00 24.00 51.00 49.00
## 59 40.0 0.42 14.50 27.00 27.00
## 60 10.0 0.75 8.00 9.00 0.75
## 61 25.0 36.00 25.00 27.00 42.00
## 62 20.0 21.00 55.00 20.00 22.00
## 63 23.0 48.00 30.00 58.00 28.00
## 64 3.0 17.00 5.00 3.00 16.00
## 65 10.0 38.00 3.00 4.00 0.75
## 66 23.0 36.50 22.00 35.00 31.00
## 67 20.0 2.00 35.00 14.00 17.00
## 68 24.0 4.00 24.00 20.00 4.00
## 69 54.0 47.00 65.00 62.00 62.00
## 70 17.0 45.00 40.00 28.00 28.00
## 71 26.0 26.00 41.00 21.00 23.00
## 72 24.0 26.00 14.50 29.00 25.00
## 73 14.5 29.00 14.00 17.00 18.00
## 74 54.0 40.00 14.50 29.00 5.00
## 75 40.0 32.00 14.50 15.00 22.00
## 76 35.0 17.00 54.00 38.00 54.00
## 77 35.0 45.00 35.00 17.00 20.00
## 78 20.0 21.00 55.00 18.00 14.00
## 79 2.0 35.00 35.00 24.00 38.00
## 80 17.0 42.00 40.00 50.00 28.00
## 81 35.0 42.00 19.00 35.00 20.00
## 82 56.0 46.00 22.00 24.00 23.00
## 83 17.0 21.00 40.00 28.00 17.00
## 84 35.0 45.00 35.00 17.00 28.00
## 85 20.0 21.00 34.00 50.00 24.00
## 86 35.0 42.00 55.00 24.00 28.00
## 87 24.0 14.00 4.00 25.00 1.00
## 88 40.0 30.00 31.00 29.00 27.00
## 89 31.0 22.00 14.00 20.00 24.00
## 90 20.0 32.00 35.00 50.00 24.00
## 91 27.0 17.00 54.00 48.00 54.00
## 92 28.0 45.00 55.00 18.00 24.00
## 93 14.0 32.00 40.00 18.00 7.00
## 94 36.0 22.00 22.00 24.00 19.00
## 95 21.0 42.00 19.00 7.00 24.00
## 96 14.0 32.00 19.00 50.00 50.00
## 97 51.0 38.00 54.00 24.00 31.00
## 98 56.0 46.00 36.00 24.00 54.00
## 99 2.0 31.00 1.00 9.00 0.75
## 100 41.0 22.00 0.67 30.00 54.00
## 101 17.0 24.00 35.00 0.83 7.00
## 102 14.0 28.00 40.00 18.00 24.00
## 103 35.0 32.00 34.00 7.00 50.00
## 104 51.0 18.00 24.00 51.00 19.00
## 105 20.0 32.00 19.00 24.00 1.00
## 106 50.0 28.00 35.00 44.00 23.00
## 107 35.0 28.00 55.00 50.00 50.00
## 108 22.0 28.00 55.00 33.00 50.00
## 109 27.0 29.00 22.00 51.00 54.00
## 110 14.0 32.00 40.00 18.00 50.00
## 111 32.0 29.00 54.00 39.00 16.00
## 112 17.0 28.00 34.00 0.83 23.00
## 113 14.0 42.00 18.00 28.00 42.00
## 114 35.0 18.00 40.00 7.00 1.00
## 115 24.0 62.00 22.00 46.00 22.00
## 116 20.0 8.00 35.00 7.00 1.00
## 117 14.0 8.00 55.00 18.00 7.00
## 118 14.0 1.00 55.00 50.00 23.00
## 119 22.0 8.00 34.00 24.00 7.00
## 120 48.0 29.00 26.00 26.00 26.00
## 121 14.5 17.00 29.00 24.00 24.00
## 122 14.0 8.00 19.00 24.00 1.00
## 123 14.0 7.00 55.00 33.00 23.00
## 124 41.0 8.00 17.00 19.00 19.00
## 125 36.0 35.00 25.00 41.00 34.00
## 126 24.0 1.00 55.00 50.00 23.00
## 127 33.0 33.00 40.00 50.00 8.00
## 128 22.0 47.00 51.00 46.00 28.00
## 129 28.0 24.00 35.00 50.00 8.00
## 130 24.0 19.00 9.00 15.00 1.00
## 131 29.0 7.00 19.00 7.00 28.00
## 132 20.0 24.00 55.00 24.00 23.00
## 133 37.0 40.00 22.00 38.00 60.00
## 134 18.0 18.00 0.67 14.50 40.00
## 135 26.0 29.00 31.00 24.00 28.00
## 136 59.0 33.00 34.00 7.00 32.00
## 137 38.0 33.00 35.00 50.00 18.00
## 138 48.0 39.00 41.00 0.42 18.00
## 139 28.0 1.00 55.00 1.00 28.00
## 140 33.0 33.00 40.00 7.00 42.00
## 141 64.0 29.00 63.00 27.00 35.00
## 142 23.0 16.00 19.00 21.00 56.00
## 143 17.0 50.00 55.00 7.00 45.00
## 144 15.0 16.00 26.00 24.00 37.00
## 145 24.0 18.00 33.00 40.00 32.00
## 146 15.0 20.00 9.00 4.00 3.00
## 147 51.0 33.00 22.00 62.00 71.00
## 148 14.0 33.00 35.00 42.00 21.00
## 149 40.0 48.00 31.00 26.00 18.00
## 150 16.0 2.00 22.00 33.00 35.00
## 151 36.0 24.00 34.00 7.00 45.00
## 152 19.0 50.00 55.00 23.00 45.00
## 153 48.0 18.00 27.00 28.00 44.00
## 154 21.0 18.00 34.00 21.00 44.00
## 155 37.0 39.00 65.00 54.00 46.00
## 156 31.0 40.00 29.00 22.00 15.00
## 157 25.0 17.00 34.00 1.00 44.00
## 158 45.5 28.00 35.00 18.00 21.00
## 159 45.5 35.00 35.00 18.00 21.00
## 160 27.0 39.00 5.00 48.00 39.00
## 161 24.0 28.00 53.00 1.00 42.00
## 162 16.0 3.00 17.00 3.00 16.00
## 163 24.0 39.00 65.00 61.00 37.00
## 164 24.0 38.00 22.00 39.00 31.00
## 165 27.0 20.00 34.00 8.00 45.00
## 166 32.0 50.00 55.00 36.00 19.00
## 167 54.0 20.00 31.00 40.00 4.00
## 168 42.0 20.00 34.00 8.00 21.00
## 169 22.0 17.00 19.00 32.00 19.00
## 170 19.0 50.00 25.00 44.00 30.00
## 171 11.0 5.00 5.00 16.00 16.00
## 172 48.0 21.00 38.00 26.00 31.00
## 173 22.0 35.00 53.00 23.00 21.00
## 174 16.0 5.00 3.00 3.00 17.00
## 175 22.0 20.00 55.00 2.00 44.00
## 176 22.0 14.00 55.00 28.00 21.00
## 177 15.0 26.00 5.00 48.00 26.00
write.csv(file="Age_IMP.csv", x=dfL2)
# Got to Python with CSV - fill in the Blanks for AGE
# Tangential Analysis -- Inspecting the Titanic Data Present in R
#
dftt<-as.data.frame(Titanic)
summary(dftt)
## Class Sex Age Survived Freq
## 1st :8 Male :16 Child:16 No :16 Min. : 0.00
## 2nd :8 Female:16 Adult:16 Yes:16 1st Qu.: 0.75
## 3rd :8 Median : 13.50
## Crew:8 Mean : 68.78
## 3rd Qu.: 77.00
## Max. :670.00
write.csv(file="Titanic_IMP.csv", x=dftt)
# Back to - Megan Risdal's code --
# Passengers 62 and 830 are missing Embarkment
full[c(62, 830), 'Embarked']
## [1] "" ""
# Get rid of our missing passenger IDs
embark_fare <- full %>%
filter(PassengerId != 62 & PassengerId != 830)
# Use ggplot2 to visualize embarkment, passenger class, & median fare
ggplot(embark_fare, aes(x = Embarked, y = Fare, fill = factor(Pclass))) +
geom_boxplot() +
geom_hline(aes(yintercept=80),
colour='red', linetype='dashed', lwd=2) +
scale_y_continuous(labels=dollar_format()) +
theme_few()
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
Work In Progress ….
Note that the echo = FALSE
parameter was added to the code chunk to prevent printing of the R code that generated the plot.
Note that the
echo = FALSE
parameter was added to the code chunk to prevent printing of the R code that generated the plot.
No comments:
Post a Comment