Reputation: 1002
I need to do multiple imputation to a data set which is similar to the following toy data set.
x1=rbinom(20,1,0.5)
x2=rnorm(20,100,2)
x2=x2/max(x2)
x3=rbinom(20,3,0.4)
x4=rnorm(20,0,0.5)
data=data.frame(x1,x2,x3,x4)
data[1,2]=NA
data[10,2]=NA
data[15,2]=NA
data[2,1]=NA
data[12,1]=NA
data[12,3]=NA
data[19,3]=NA
data[20,3]=NA
> data
x1 x2 x3 x4
1 0 NA 2 0.103982689
2 NA 0.9599301 1 -0.153152527
3 1 0.9563003 0 -0.783492651
4 0 0.9974261 1 0.325931603
5 1 0.9515747 3 -0.769568378
6 1 0.9431853 0 0.336488307
7 0 0.9637072 1 0.383011575
8 0 0.9937089 0 -0.575941420
9 1 0.9357041 3 -0.648096345
10 0 NA 3 0.213382349
11 0 0.9454354 1 0.111094020
12 NA 0.9330617 NA -0.256448985
where x1 is a binary variable, x2 is a variables between 0 and 1 and x3 is a ordinal level variable. I did the imputation using amelia
function from the Amelia
package. But it seems that the imputed values are not within the desired range.
require(Amelia)
imp=amelia(data,m=2)
imp1=imp$imputations[[1]]
imp2=imp$imputations[[2]]
For an example, I got values greater than 1 for x1 and x2. Also the values for x1 and x3 does not preserve the categorical nature of the data.
Is it possible to do the imputation for a data set that involves categorical variables (including binary and ordinal variables) and continuous variables within a certain range using R?
Upvotes: 0
Views: 489
Reputation: 11046
The Amelia
package, named after a famous missing person, imputes values to missing data. The mice
package, "multiple imputation with chained equations", also imputes values to missing data and can handle categorical data. First we need to create some reproducible data to share using dput(data)
. I've rounded the values to make it more compact:
data <- structure(list(x1 = structure(c(2L, NA, 1L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, NA, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L), .Label = c("0",
"1"), class = "factor"), x2 = c(NA, 0.962, 0.951, 0.964, 0.968,
0.948, 0.96, 0.944, 1, NA, 0.956, 0.944, 0.975, 0.965, NA, 0.928,
0.975, 0.948, 0.968, 0.958), x3 = structure(c(2L, 2L, 3L, 3L,
1L, 2L, 2L, 1L, 1L, 1L, 3L, NA, 2L, 1L, 2L, 3L, 3L, 2L, NA, NA
), .Label = c("0", "1", "2"), class = "factor"), x4 = c(0.433,
0.29, -0.01, -0.092, -0.29, 0.2, 0.284, 0.206, 0.132, -0.024,
-0.188, 0.57, 0.092, -0.624, 0.241, -0.262, -0.621, -0.888, 0.346,
-0.043)), row.names = c(NA, -20L), class = "data.frame")
str(data)
# 'data.frame': 20 obs. of 4 variables:
# $ x1: Factor w/ 2 levels "0","1": 2 NA 1 2 2 2 2 1 1 1 ...
# $ x2: num NA 0.962 0.951 0.964 0.968 0.948 0.96 0.944 1 NA ...
# $ x3: Factor w/ 3 levels "0","1","2": 2 2 3 3 1 2 2 1 1 1 ...
# $ x4: num 0.433 0.29 -0.01 -0.092 -0.29 0.2 0.284 0.206 0.132 -0.024 ...
You can see the first and third variables are factors/categorical.
data.imp <- mice(data)
The list data.imp
contains the information to produce 5 sets of data with the missing values imputed. To get the first set:
data.imp1 <- complete(data.imp, 1)
str(data.imp1)
# 'data.frame': 20 obs. of 4 variables:
# $ x1: Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 1 1 1 ...
# $ x2: num 0.965 0.962 0.951 0.964 0.968 0.948 0.96 0.944 1 0.962 ...
# $ x3: Factor w/ 3 levels "0","1","2": 2 2 3 3 1 2 2 1 1 1 ...
# $ x4: num 0.433 0.29 -0.01 -0.092 -0.29 0.2 0.284 0.206 0.132 -0.024 ...
Note that the first and third variables are still factors.
Upvotes: 1