jay.sf
jay.sf

Reputation: 72603

How to impute values from groups in a category with `sapply()`?

I want to impute missings in val of all ctry in cat1 with particular ctry means.

Data example

set.seed(654)
df1 <- data.frame(
  year=rep(2000:2005, each=5),
  ctry=rep(LETTERS[1:5], 6),
  val=rnorm(30)
)
df1$cat <- ifelse(df1$ctry %in% c("A", "B"), 1, 0)
df1[sample(nrow(df1), 12), "val"] <- NA
> head(df1)
  year ctry         val cat
1 2000    A -0.76031762   1
2 2000    B -0.38970450   1
3 2000    C  1.68962523   0
4 2000    D          NA   0
5 2000    E  0.09530146   0
6 2001    A          NA   1

First, I get the names of ctry in cat1 and allocate their means.

cat1 <- as.character(sort(unique(
  df1[!is.na(df1$val) & df1$cat == 1, ]
  [, 2])))
cat1 <- sapply(cat1, function(x) mean(df1$val[df1$ctry == x], na.rm=TRUE))
> cat1
        A         B 
0.4372003 0.4792314 

Now I succeed in manually imputing country by country:

df2 <- df1
df2$val[df2$ctry %in% names(cat1)[1] & is.na(df2$val)] <- cat1[1]
> head(df2)
  year ctry         val cat
1 2000    A -0.76031762   1
2 2000    B -0.38970450   1
3 2000    C  1.68962523   0
4 2000    D          NA   0
5 2000    E  0.09530146   0
6 2001    A -0.49758245   1

But for some reason I can't bring this sapply() to work, to do the imputation automatically:

> sapply(seq_along(cat1), 
+        function(x) df2$val[df2$ctry %in% names(cat1)[x] & is.na(df2$val)] <- cat1[x])
         A          B 
-0.4975825 -0.6139364 

The expected output would be a whole data frame with particular imputed means of the countries in category cat1.

Upvotes: 0

Views: 68

Answers (3)

dondapati
dondapati

Reputation: 849

if my understanding is correct you want automatic the last process

sapply(seq_along(cat1), 
 function(x) df2$val[df2$ctry %in% names(cat1)[x] & is.na(df2$val)] <<- cat1[x])

> df2
   year ctry          val cat
1  2000    A -0.760317618   1
2  2000    B -0.389704501   1
3  2000    C  1.689625228   0
4  2000    D           NA   0
5  2000    E  0.095301460   0
6  2001    A -0.497582454   1
7  2001    B -0.613936417   1
8  2001    C           NA   0
9  2001    D           NA   0
10 2001    E           NA   0
11 2002    A -0.107260116   1
12 2002    B -0.838168333   1
13 2002    C -0.982605890   0
14 2002    D -0.820370986   0
15 2002    E -0.871432562   0
16 2003    A -0.497582454   1
17 2003    B -0.613936417   1
18 2003    C -0.006557849   0
19 2003    D  0.661696295   0
20 2003    E -0.762828067   0
21 2004    A -0.286692466   1
22 2004    B -0.613936417   1
23 2004    C  0.512579937   0
24 2004    D  0.722127317   0
25 2004    E           NA   0
26 2005    A -0.836059616   1
27 2005    B -0.613936417   1
28 2005    C  0.774016151   0
29 2005    D -0.691866605   0
30 2005    E           NA   0

Here i just replace the < with <<- scoping assignment

Upvotes: 1

r.user.05apr
r.user.05apr

Reputation: 5456

In Base R:

set.seed(654)
df1 <- data.frame(
  year=rep(2000:2005, each=5),
  ctry=rep(LETTERS[1:5], 6),
  val=rnorm(30)
)
df1$cat <- ifelse(df1$ctry %in% c("A", "B"), 1, 0)
df1[sample(nrow(df1), 12), "val"] <- NA

# want:
my_means <- tapply(df1$val, df1$ctry, mean, na.rm = TRUE)
df1$val <- ifelse(is.na(df1$val), my_means[df1$ctry], df1$val)

Upvotes: 1

Roman
Roman

Reputation: 17648

You can try a tidyverse approach using group_by to get the mean per ctry. Then update the NA's using ifelse. Added a new column val2 to illustrate what is happening. You can write "val" to override the column.

library(tidyverse)
df1 %>% 
  group_by(ctry) %>% 
  mutate(Mean=mean(val, na.rm = T)) %>% 
  mutate(val2=ifelse(is.na(val) & cat == 1, Mean, val)) %>% 
  ungroup()
# A tibble: 30 x 6
    year ctry       val   cat    Mean     val2
   <int> <fct>    <dbl> <dbl>   <dbl>    <dbl>
 1  2000 A      -0.760      1 -0.498   -0.760 
 2  2000 B      -0.390      1 -0.614   -0.390 
 3  2000 C       1.69       0  0.397    1.69  
 4  2000 D      NA          0 -0.0321  NA     
 5  2000 E       0.0953     0 -0.513    0.0953
 6  2001 A      NA          1 -0.498   -0.498 
 7  2001 B      NA          1 -0.614   -0.614 
 8  2001 C      NA          0  0.397   NA     
 9  2001 D      NA          0 -0.0321  NA     
10  2001 E      NA          0 -0.513   NA     
# ... with 20 more rows

Upvotes: 1

Related Questions