Reputation: 367
I've been working through the How to perform a Logistic Regression in R tutorial on R-bloggers, which the data set from the Kaggle Titanic challenge is used. A gist with all of the code in the post can be found here.
There exists missing data in the training data set:
Data for 891 passengers are included in this data set (891 rows) and 177 have missing Age
values:
type missing method model
PassengerId continuous 0 <NA> <NA>
Survived binary 0 <NA> <NA>
Pclass ordered-categorical 0 <NA> <NA>
Name unordered-categorical 0 <NA> <NA>
Sex binary 0 <NA> <NA>
Age continuous 177 ppd linear <----
SibSp continuous 0 <NA> <NA>
Parch continuous 0 <NA> <NA>
Ticket unordered-categorical 0 <NA> <NA>
Fare continuous 0 <NA> <NA>
Cabin unordered-categorical 687 ppd mlogit
Embarked unordered-categorical 2 ppd mlogit
In the tutorial, the missing values are simply replaced by the mean of the present Age
values:
data$Age[is.na(data$Age)] <- mean(data$Age,na.rm=T)
I am interested in imputing the missing values instead of doing mean or median replacement. Several imputation libraries exist, such as amelia and MICE, but I have used mi
in the past, which is why I chose to use mi
for this problem.
The main issue is that the range of the imputed values when I used mi
was not reasonable:
The red bar is the mean of each distribution. Passenger age ranges from 0.42 to 80 (years). The imputed values range from less than -100 to greater than 200.
Obviously this is not useful at all. Below is the code that I used. I used the mi vignette as a guide.
library(mi)
training.data.raw <- read.csv("train.csv", header = TRUE, na.strings = c(""))
# create missing data frame for use with mi
training.data.raw.mdf <- missing_data.frame(training.data.raw)
#image(training.data.raw.mdf)
# adjust variable types
training.data.raw.mdf <- change(training.data.raw.mdf, y = "Parch", what = "type", to = "ord")
training.data.raw.mdf <- change(training.data.raw.mdf, y = "SibSp", what = "type", to = "count")
training.data.raw.mdf <- change(training.data.raw.mdf, y = "PassengerId", what = "type", to = "irrelevant")
# parallel imputation should be default on non-Windows systems (i.e. Linux)
imputations <- mi(training.data.raw.mdf, n.iter = 30, n.chains = 4, max.minutes = 20)
round(mipply(imputations, mean, to.matrix = TRUE), 3)
# get data frames
imputed.dataframes <- complete(imputations, m = 1)
Is there a way to control the range of imputed values such that the they fall between, let's say, 0 and 80?
I will gladly use any imputation library - mi, MICE, amelia - as long as reasonable results are produced. Any methods and any libraries that produce reasonable results are of interest.
Upvotes: 0
Views: 1504
Reputation: 441
Try the bounded-continuous-class
option from the mi
package. That should do it for you.
Here is the example from the documentation:
# STEP 0: GET DATA
data(CHAIN, package = "mi")
# STEP 0.5 CREATE A missing_variable (you never need to actually do this)
lo_bound <- 0
hi_bound <- rep(Inf, nrow(CHAIN))
hi_bound[CHAIN$log_virus == 0] <- 6
log_virus <- missing_variable(ifelse(CHAIN$log_virus == 0, NA, CHAIN$log_virus),
type = "bounded-continuous",
lower = lo_bound, upper = hi_bound)
show(log_virus)
Upvotes: 2