Reputation: 59
I have a big dataset with 20% of missing strings.
NAME | AREA
--------------------------
Andy | Sales
Andy | NA
Andy | Sales
Andy | Sales
Andy | NA
Andy | Sales
Sandy | Construction
Sandy | Construction
Sandy | NA
Sandy | Construction
Sandy | Construction
Wendy | Planting
Wendy | Driving
Wendy | NA
Wendy | NA
Wendy | NA
In most cases of my data it is almost obvious, that Andy does Sales and Sandy is in Construction. But we cannot make sure about Wendy.
My desirable result is:
NAME | AREA
--------------------------
Andy | Sales
Andy | Sales
Andy | Sales
Andy | Sales
Andy | Sales
Andy | Sales
Sandy | Construction
Sandy | Construction
Sandy | Construction
Sandy | Construction
Sandy | Construction
Wendy | Planting
Wendy | Driving
Wendy | NA
Wendy | NA
Wendy | NA
Which is the best imputation package to handle it? Or, maybe, you have a better solution?
Thanks in advance!
Upvotes: 0
Views: 432
Reputation: 886938
Here is one option with data.table
library(data.table)
setDT(df)[, AREA := if(uniqueN(AREA, na.rm = TRUE) == 1)
first(AREA[!is.na(AREA)]) else AREA, NAME]
df
# NAME AREA
# 1: Andy Sales
# 2: Andy Sales
# 3: Andy Sales
# 4: Andy Sales
# 5: Andy Sales
# 6: Andy Sales
# 7: Sandy Construction
# 8: Sandy Construction
# 9: Sandy Construction
#10: Sandy Construction
#11: Sandy Construction
#12: Wendy Planting
#13: Wendy Driving
#14: Wendy <NA>
#15: Wendy <NA>
#16: Wendy <NA>
df <- structure(list(NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("Andy", "Sandy",
"Wendy"), class = "factor"), AREA = structure(c(4L, NA, 4L, 4L,
NA, 4L, 1L, 1L, NA, 1L, 1L, 3L, 2L, NA, NA, NA), .Label =
c("Construction", "Driving", "Planting", "Sales"),
class = "factor")), class = "data.frame", row.names = c(NA, -16L))
Upvotes: 0
Reputation: 148
You can use the mice
package. It's very customizable but a simple implementation would be:
library(mice)
dt <- mutate(dt, AREA = as.factor(AREA)) #make sure that area is a categorical variable
imputed_dt <- mice(dt) %>% complete()
In this basic example mice will try to impute values for Wendy. But you should dig into the documentation.
Upvotes: 3
Reputation: 388797
Maybe you can try conditional fill based on distinct value in each group
library(dplyr)
df %>%
group_by(NAME) %>%
mutate(AREA = if(n_distinct(AREA, na.rm = TRUE) == 1) first(AREA) else AREA)
# NAME AREA
# <fct> <fct>
# 1 Andy Sales
# 2 Andy Sales
# 3 Andy Sales
# 4 Andy Sales
# 5 Andy Sales
# 6 Andy Sales
# 7 Sandy Construction
# 8 Sandy Construction
# 9 Sandy Construction
#10 Sandy Construction
#11 Sandy Construction
#12 Wendy Planting
#13 Wendy Driving
#14 Wendy NA
#15 Wendy NA
#16 Wendy NA
data
df <- structure(list(NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("Andy", "Sandy",
"Wendy"), class = "factor"), AREA = structure(c(4L, NA, 4L, 4L,
NA, 4L, 1L, 1L, NA, 1L, 1L, 3L, 2L, NA, NA, NA), .Label =
c("Construction", "Driving", "Planting", "Sales"),
class = "factor")), class = "data.frame", row.names = c(NA, -16L))
Upvotes: 3