julian
julian

Reputation: 367

R - Is there a way to restrict the range of values imputed by 'mi'? (Working with Kaggle Titanic data set)

I've been working through the How to perform a Logistic Regression in R tutorial on R-bloggers, which the data set from the Kaggle Titanic challenge is used. A gist with all of the code in the post can be found here.

There exists missing data in the training data set:

Missing data in training data set

Data for 891 passengers are included in this data set (891 rows) and 177 have missing Age values:

                             type missing method  model
PassengerId            continuous       0   <NA>   <NA>
Survived                   binary       0   <NA>   <NA>
Pclass        ordered-categorical       0   <NA>   <NA>
Name        unordered-categorical       0   <NA>   <NA>
Sex                        binary       0   <NA>   <NA>
Age                    continuous     177    ppd linear   <----
SibSp                  continuous       0   <NA>   <NA>
Parch                  continuous       0   <NA>   <NA>
Ticket      unordered-categorical       0   <NA>   <NA>
Fare                   continuous       0   <NA>   <NA>
Cabin       unordered-categorical     687    ppd mlogit
Embarked    unordered-categorical       2    ppd mlogit

In the tutorial, the missing values are simply replaced by the mean of the present Age values:

data$Age[is.na(data$Age)] <- mean(data$Age,na.rm=T)

I am interested in imputing the missing values instead of doing mean or median replacement. Several imputation libraries exist, such as amelia and MICE, but I have used mi in the past, which is why I chose to use mi for this problem.

The main issue is that the range of the imputed values when I used mi was not reasonable:

Age vs Imputed

The red bar is the mean of each distribution. Passenger age ranges from 0.42 to 80 (years). The imputed values range from less than -100 to greater than 200.

Boxplot

Obviously this is not useful at all. Below is the code that I used. I used the mi vignette as a guide.

    library(mi)

    training.data.raw <- read.csv("train.csv", header = TRUE, na.strings = c(""))
    # create missing data frame for use with mi
    training.data.raw.mdf <- missing_data.frame(training.data.raw)
    #image(training.data.raw.mdf)


    # adjust variable types
    training.data.raw.mdf <- change(training.data.raw.mdf, y = "Parch", what = "type", to = "ord")
    training.data.raw.mdf <- change(training.data.raw.mdf, y = "SibSp", what = "type", to = "count")
    training.data.raw.mdf <- change(training.data.raw.mdf, y = "PassengerId", what = "type", to = "irrelevant")

    # parallel imputation should be default on non-Windows systems (i.e. Linux)
    imputations <- mi(training.data.raw.mdf, n.iter = 30, n.chains = 4, max.minutes = 20)
    round(mipply(imputations, mean, to.matrix = TRUE), 3)

    # get data frames
    imputed.dataframes <- complete(imputations, m = 1)

Is there a way to control the range of imputed values such that the they fall between, let's say, 0 and 80?

I will gladly use any imputation library - mi, MICE, amelia - as long as reasonable results are produced. Any methods and any libraries that produce reasonable results are of interest.

Upvotes: 0

Views: 1504

Answers (1)

fzk
fzk

Reputation: 441

Try the bounded-continuous-class option from the mi package. That should do it for you.

Here is the example from the documentation:

# STEP 0: GET DATA
data(CHAIN, package = "mi")

# STEP 0.5 CREATE A missing_variable (you never need to actually do this)
lo_bound <- 0
hi_bound <- rep(Inf, nrow(CHAIN))
hi_bound[CHAIN$log_virus == 0] <- 6

log_virus <- missing_variable(ifelse(CHAIN$log_virus == 0, NA, CHAIN$log_virus),
                              type = "bounded-continuous",
                              lower = lo_bound, upper = hi_bound)

show(log_virus)

Upvotes: 2

Related Questions