Vikash Balasubramanian
Vikash Balasubramanian

Reputation: 3233

Summarizing by group of two variables

Consider a simplified dataset (the real one has more columns and rows):

df
    tp tf   weight
1  FWD RF 78.86166
2   MF LF 81.04566
3  DEF LF 80.70527
4  DEF LF 82.96071
5  DEF RF 78.42544
6   GK LF 79.37686
7  DEF RF 78.79928
8   MF RF       NA
9   MF RF 78.93815
10 DEF RF 80.00284

I want to fill the missing values in weight by the grouped median of tp and tf combined

What i have tried till now is the following (I have used dlpyr)

temp <- df %>% group_by(tp,tf) %>% summarise(mvalue = median(weight,na. rm = TRUE))

This gives temp to be :

temp
Source: local data frame [6 x 3]
Groups: tp [?]

      tp     tf   mvalue
  <fctr> <fctr>    <dbl>
1    DEF     LF 81.83299
2    DEF     RF 78.79928
3    FWD     RF 78.86166
4     GK     LF 79.37686
5     MF     LF 81.04566
6     MF     RF 78.93815

Now i am unable to figure out how to fill the missing values in df with the corresponding group median.

In my simple case there is only one NA corresponding to tp = MF and tf = RF, the median value if you look up at temp is 78.93815

How do i do this in general? Do suggest if you have a better approach than my initial one.

EDIT: The actual dataframe has a unique Id variable if that makes any difference or can help.

Upvotes: 3

Views: 2626

Answers (2)

akrun
akrun

Reputation: 886938

We can use the na.aggregate from zoo and change the FUN argument to median. By default it would be mean

library(zoo)
library(dplyr)
df %>% 
   group_by(tp, tf) %>%
    mutate(weight = na.aggregate(weight, FUN = median))
#     tp    tf   weight
#   <chr> <chr>    <dbl>
#1    FWD    RF 78.86166
#2     MF    LF 81.04566
#3    DEF    LF 80.70527
#4    DEF    LF 82.96071
#5    DEF    RF 78.42544
#6     GK    LF 79.37686
#7    DEF    RF 78.79928
#8     MF    RF 78.93815
#9     MF    RF 78.93815
#10   DEF    RF 80.00284

The corresponding syntax for data.table would be

library(data.table)
setDT(df)[, weight := na.aggregate(weight, FUN = median), by = .(tp, tf)]

Upvotes: 4

Sotos
Sotos

Reputation: 51582

You can try,

library(dplyr)
df %>% 
   group_by(tp, tf) %>% 
   mutate(weight = replace(weight, is.na(weight), median(weight, na.rm = TRUE)))

Upvotes: 5

Related Questions