Reputation: 80
Edits I'm editing this post a little bit to provide a bit more context in case the whole approach was wrong from the start. See "Context" below for trying to explain the problem more abstractly.
I have seen the thread where the matching of NAs in tibbles is discussed, and the options are to match them to other NAs, or not to match them to anything: dplyr left_join matching NA
However, I am really looking for the opposite behaviour. Is there a way of having NAs (or whichever missing value for that case) matched to any other value during a join operation? An example below:
library(tidyverse)
# Removed output for brevity
tbl1 <- tibble(subj = 1, run = 1, session=1)
tbl2 <- tibble(subj = c(1, NA, 2), run = c(NA, 1, 2), session=c(NA, NA, 1), outcomedata = c(NA, NA, NA) )
tbl2$outcomedata[2][[1]] <- list(temperature=30)
tbl2$outcomedata[1][[1]] <- list(height=155, weight=80)
tbl2$outcomedata[3][[1]] <- list(temperature=20)
tbl1
#> # A tibble: 1 x 3
#> subj run session
#> <dbl> <dbl> <dbl>
#> 1 1.00 1.00 1.00
tbl2
#> # A tibble: 3 x 4
#> subj run session outcomedata
#> <dbl> <dbl> <dbl> <list>
#> 1 1.00 NA NA <list [2]>
#> 2 NA 1.00 NA <list [1]>
#> 3 2.00 2.00 1.00 <list [1]>
left_join(tbl1, tbl2)
#> Joining, by = c("subj", "run", "session")
#> # A tibble: 1 x 4
#> subj run session outcomedata
#> <dbl> <dbl> <dbl> <list>
#> 1 1.00 1.00 1.00 <NULL>
My desired end result is that I can match the first and the second row of tbl2 to the single row of tbl1, since these rows match on all non-NA attributes. The third row should not match to anything, since it differs on non-NA values. Thus, I am trying to get the final output to be as follows:
#> # A tibble: 2 x 4
#> subj run session outcomedata
#> <dbl> <dbl> <dbl> <list>
#> 1 1.00 1.00 1.00 <list [2]>
#> 2 1.00 1.00 1.00 <list [1]>
Context
Let me provide context in case I am way out here and barking up the wrong tree with the joins and there's an easier alternative. I have a bunch of nested json files (which I instantiate in R as lists), which contain various information that I want to attribute to specific instances in the data. One json might contain information which pertains to all instances in the data for subject 1 (i.e. the first row of tbl2), while another pertains to all instances in the data for run 1 (i.e. the second row of tbl2).
I would like to be able to merge all relevant information for each constellation of parameters in the data (one of which is in tbl1, but the plan is to have them all) in separate lists. My plan has been to try to get everything to match to everything related, and then to use a group_by operation over all parameters (i.e. group_by(subj, run, session)) and merge the lists (my plan was to use rlist::list.merge).
Any help would be massively appreciated!
Upvotes: 3
Views: 1079
Reputation: 47320
Here's a tidyverse
solution :
tbl2 %>%
split(seq(nrow(.))) %>% # split into one row data frames
map_dfr(~modify_if(.,is.na,~NULL) %>% # remove na columns
inner_join(tbl1,.)) # inner join to table1
# # A tibble: 2 x 4
# subj run session outcomedata
# <dbl> <dbl> <dbl> <list>
# 1 1 1 1 <list [2]>
# 2 1 1 1 <list [1]>
I use inner_join(tbl1,.)
instead of inner_join(tbl1)
to preserve column order.
And a base R
translation :
df_list <- split(tbl2,seq(nrow(tbl2)))
df_list <- lapply(df_list,function(dfi){
merge(tbl1, dfi[!sapply(dfi,is.na)])
})
do.call(rbind,df_list)
# subj run session outcomedata
# 1 1 1 1 155, 80
# 2 1 1 1 30
Bonus
2 100% tidyverse approaches using group_by
instead of split
. one with do
, one with nest
and map
. do
is being soft deprecated FYI but here it offers more compact and readable syntax:
tbl2 %>%
group_by(n=seq(n())) %>%
do(modify_if(.,is.na,~NULL) %>% # remove na columns
inner_join(tbl1,.)) %>%
ungroup %>%
select(-n)
tbl2 %>%
rowid_to_column("n") %>%
group_by(n) %>%
nest(.key="dfi") %>%
mutate_at("dfi",~map(.,
~ modify_if(.,is.na,~NULL) %>% # remove na columns
inner_join(tbl1,.))) %>%
unnest %>%
select(-n)
Upvotes: 3