Dplyr join: NA match to any

Question

Edits I'm editing this post a little bit to provide a bit more context in case the whole approach was wrong from the start. See "Context" below for trying to explain the problem more abstractly.

I have seen the thread where the matching of NAs in tibbles is discussed, and the options are to match them to other NAs, or not to match them to anything: dplyr left_join matching NA

However, I am really looking for the opposite behaviour. Is there a way of having NAs (or whichever missing value for that case) matched to any other value during a join operation? An example below:

library(tidyverse)
# Removed output for brevity

tbl1 <- tibble(subj = 1, run = 1, session=1)

tbl2 <- tibble(subj = c(1, NA, 2), run = c(NA, 1, 2), session=c(NA, NA, 1), outcomedata = c(NA, NA, NA) )

tbl2$outcomedata[2][[1]] <- list(temperature=30)
tbl2$outcomedata[1][[1]] <- list(height=155, weight=80)
tbl2$outcomedata[3][[1]] <- list(temperature=20)

tbl1
#> # A tibble: 1 x 3
#>    subj   run session
#>       
#> 1  1.00  1.00    1.00
tbl2
#> # A tibble: 3 x 4
#>    subj   run session outcomedata
#>             
#> 1  1.00 NA      NA     
#> 2 NA     1.00   NA     
#> 3  2.00  2.00    1.00 

left_join(tbl1, tbl2)
#> Joining, by = c("subj", "run", "session")
#> # A tibble: 1 x 4
#>    subj   run session outcomedata
#>             
#> 1  1.00  1.00    1.00

My desired end result is that I can match the first and the second row of tbl2 to the single row of tbl1, since these rows match on all non-NA attributes. The third row should not match to anything, since it differs on non-NA values. Thus, I am trying to get the final output to be as follows:

#> # A tibble: 2 x 4
#>    subj  run   session  outcomedata
#>                 
#> 1  1.00  1.00    1.00      
#> 2  1.00  1.00    1.00

Context

Let me provide context in case I am way out here and barking up the wrong tree with the joins and there's an easier alternative. I have a bunch of nested json files (which I instantiate in R as lists), which contain various information that I want to attribute to specific instances in the data. One json might contain information which pertains to all instances in the data for subject 1 (i.e. the first row of tbl2), while another pertains to all instances in the data for run 1 (i.e. the second row of tbl2).

I would like to be able to merge all relevant information for each constellation of parameters in the data (one of which is in tbl1, but the plan is to have them all) in separate lists. My plan has been to try to get everything to match to everything related, and then to use a group_by operation over all parameters (i.e. group_by(subj, run, session)) and merge the lists (my plan was to use rlist::list.merge).

Any help would be massively appreciated!

moodymudskipper · Accepted Answer

Here's a tidyverse solution :

tbl2 %>%
  split(seq(nrow(.))) %>%               # split into one row data frames
  map_dfr(~modify_if(.,is.na,~NULL) %>% # remove na columns
        inner_join(tbl1,.))             # inner join to table1

# # A tibble: 2 x 4
#    subj   run session outcomedata
#                 
# 1     1     1       1  
# 2     1     1       1

I use inner_join(tbl1,.) instead of inner_join(tbl1) to preserve column order.

And a base R translation :

df_list <- split(tbl2,seq(nrow(tbl2)))
df_list <- lapply(df_list,function(dfi){
  merge(tbl1, dfi[!sapply(dfi,is.na)])
})
do.call(rbind,df_list)
#   subj run session outcomedata
# 1    1   1       1     155, 80
# 2    1   1       1          30

Bonus

2 100% tidyverse approaches using group_by instead of split. one with do, one with nest and map. do is being soft deprecated FYI but here it offers more compact and readable syntax:

tbl2 %>%
  group_by(n=seq(n())) %>%
  do(modify_if(.,is.na,~NULL) %>% # remove na columns
            inner_join(tbl1,.)) %>%
  ungroup %>%
  select(-n)

tbl2 %>%
  rowid_to_column("n") %>%
  group_by(n) %>%
  nest(.key="dfi") %>%
  mutate_at("dfi",~map(.,
                       ~ modify_if(.,is.na,~NULL) %>% # remove na columns
                         inner_join(tbl1,.))) %>%
  unnest %>%
  select(-n)

Dplyr join: NA match to any

Answers (1)

Related Questions