user3310782
user3310782

Reputation: 831

Double left join in dplyr to recover values

I've checked this issue but couldn't find a matching entry.

Say you have 2 DFs:

df1:mode   df2:sex
1           1
2           2
3

And a DF3 where most of the combinations are not present, e.g.

mode | sex  | cases      
1        1      9
1        1      2
2        2      7
3        1      2
1        2      5

and you want to summarise it with dplyr obtaining all combinations (with not existent ones=0):

  mode | sex  | cases      
    1        1     11
    1        2     5
    2        1     0
    2        2     7
    3        1     2
    3        2     0    

If you do a single left_join (left_join(df1,df3) you recover the modes not in df3, but 'Sex' appears as 'NA', and the same if you do left_join(df2,df3).

So how can you do both left join to recover all absent combinations, with cases=0? dplyr preferred, but sqldf an option.

Thanks in advance, p.

Upvotes: 2

Views: 822

Answers (2)

aosmith
aosmith

Reputation: 36076

The development version of tidyr, tidyr_0.2.0.9000, has a new function called complete that I saw the other day that seems like it was made for just this sort of situation.

The help page says:

This is a wrapper around expand(), left_join() and replace_na that's useful for completing missing combinations of data. It turns implicitly missing values into explicitly missing values.

To add the missing combinations of df3 and fill with 0 values instead, you would do:

library(tidyr)
library(dplyr)

df3 %>% complete(mode, sex, fill = list(cases = 0))

  mode sex cases
1    1   1     9
2    1   1     2
3    1   2     5
4    2   1     0
5    2   2     7
6    3   1     2
7    3   2     0

You would still need to group_by and summarise to get the final output you want.

df3 %>% complete(mode, sex, fill = list(cases = 0)) %>%
    group_by(mode, sex) %>%
    summarise(cases = sum(cases))

Source: local data frame [6 x 3]
Groups: mode

  mode sex cases
1    1   1    11
2    1   2     5
3    2   1     0
4    2   2     7
5    3   1     2
6    3   2     0

Upvotes: 5

MrFlick
MrFlick

Reputation: 206187

First here's you data in a more friendly, reproducible format

df1 <- data.frame(mode=1:3)
df2 <- data.frame(sex=1:2)
df3 <- data.frame(mode=c(1,1,2,3,1), sex=c(1,1,2,1,2), cases=c(9,2,7,2,5))

I don't see an option for a full outer join in dplyr, so I'm going to use base R here to merge df1 and df2 to get all mode/sex combinations. Then i left join that to the data and replace NA values with zero.

mm <- merge(df1,df2) %>% left_join(df3)
mm$cases[is.na(mm$cases)] <- 0
mm %>% group_by(mode,sex) %>% summarize(cases=sum(cases))

which gives

  mode sex cases
1    1   1    11
2    1   2     5
3    2   1     0
4    2   2     7
5    3   1     2
6    3   2     0

Upvotes: 3

Related Questions