Nihat
Nihat

Reputation: 59

Imputation for missing strings in R

I have a big dataset with 20% of missing strings.

NAME      |  AREA
--------------------------
Andy      |  Sales
Andy      |  NA
Andy      |  Sales
Andy      |  Sales
Andy      |  NA
Andy      |  Sales
Sandy     |  Construction
Sandy     |  Construction
Sandy     |  NA
Sandy     |  Construction
Sandy     |  Construction
Wendy     |  Planting
Wendy     |  Driving
Wendy     |  NA
Wendy     |  NA
Wendy     |  NA

In most cases of my data it is almost obvious, that Andy does Sales and Sandy is in Construction. But we cannot make sure about Wendy.

My desirable result is:

NAME      |  AREA
--------------------------
Andy      |  Sales
Andy      |  Sales
Andy      |  Sales
Andy      |  Sales
Andy      |  Sales
Andy      |  Sales
Sandy     |  Construction
Sandy     |  Construction
Sandy     |  Construction
Sandy     |  Construction
Sandy     |  Construction
Wendy     |  Planting
Wendy     |  Driving
Wendy     |  NA
Wendy     |  NA
Wendy     |  NA

Which is the best imputation package to handle it? Or, maybe, you have a better solution?

Thanks in advance!

Upvotes: 0

Views: 432

Answers (3)

akrun
akrun

Reputation: 886938

Here is one option with data.table

library(data.table)
setDT(df)[,  AREA := if(uniqueN(AREA, na.rm = TRUE) == 1) 
              first(AREA[!is.na(AREA)]) else AREA, NAME]
df
#     NAME         AREA
# 1:  Andy        Sales
# 2:  Andy        Sales
# 3:  Andy        Sales
# 4:  Andy        Sales
# 5:  Andy        Sales
# 6:  Andy        Sales
# 7: Sandy Construction
# 8: Sandy Construction
# 9: Sandy Construction
#10: Sandy Construction
#11: Sandy Construction
#12: Wendy     Planting
#13: Wendy      Driving
#14: Wendy         <NA>
#15: Wendy         <NA>
#16: Wendy         <NA>

data

df <- structure(list(NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("Andy", "Sandy", 
"Wendy"), class = "factor"), AREA = structure(c(4L, NA, 4L, 4L, 
NA, 4L, 1L, 1L, NA, 1L, 1L, 3L, 2L, NA, NA, NA), .Label = 
c("Construction", "Driving", "Planting", "Sales"), 
class = "factor")), class = "data.frame", row.names = c(NA, -16L))    

Upvotes: 0

Samsa
Samsa

Reputation: 148

You can use the mice package. It's very customizable but a simple implementation would be:

library(mice)
dt <- mutate(dt, AREA = as.factor(AREA)) #make sure that area is a categorical variable

imputed_dt <- mice(dt) %>% complete()

In this basic example mice will try to impute values for Wendy. But you should dig into the documentation.

Upvotes: 3

Ronak Shah
Ronak Shah

Reputation: 388797

Maybe you can try conditional fill based on distinct value in each group

library(dplyr)

df %>%
  group_by(NAME) %>%
  mutate(AREA = if(n_distinct(AREA, na.rm = TRUE) == 1) first(AREA) else AREA)


#   NAME  AREA        
#   <fct> <fct>       
# 1 Andy  Sales       
# 2 Andy  Sales       
# 3 Andy  Sales       
# 4 Andy  Sales       
# 5 Andy  Sales       
# 6 Andy  Sales       
# 7 Sandy Construction
# 8 Sandy Construction
# 9 Sandy Construction
#10 Sandy Construction
#11 Sandy Construction
#12 Wendy Planting    
#13 Wendy Driving     
#14 Wendy NA          
#15 Wendy NA          
#16 Wendy NA      

data

df <- structure(list(NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("Andy", "Sandy", 
"Wendy"), class = "factor"), AREA = structure(c(4L, NA, 4L, 4L, 
NA, 4L, 1L, 1L, NA, 1L, 1L, 3L, 2L, NA, NA, NA), .Label = 
c("Construction", "Driving", "Planting", "Sales"), 
class = "factor")), class = "data.frame", row.names = c(NA, -16L))    

Upvotes: 3

Related Questions