Reputation: 7526

Joins in R with multiple OR statements

I have two dataframes - test and idx - my goal is to use merge() or a similar function to make a conditional join.

For example, test has ids with multiple possible keys (and NA values in some). You wont find the same keys for two separate ids, the keys will always be unique.

> test
   id keyA keyB keyC
1 foo   NA    2   10
2 bar    1   NA    6
3 baz    7   NA    4
4  li    8    3   NA
5 qux    9   NA   NA

My goal is to join with idx where there is any match in keys, for example:

> idx
  key value
1   2    NA
2  10    NA
3   7    NA
4   4    NA
5   9    NA

The join should output:

> idx
  key value
1   2   foo
2  10   foo
3   7   bar
4   4   bar
5   9   quz

I understand how to use merge on one or multiple columns, but am not sure how to use it when there is an OR statement involved ( in this case, the match would be on keyA OR keyB OR keyC)

How would one perform this join in R?

the data:

dput(test)
structure(list(id = c("foo", "bar", "baz", "li", "qux"), keyA = c(NA, 
1, 7, 8, 9), keyB = c(2, NA, NA, 3, NA), keyC = c(10, 6, 4, NA, 
NA)), row.names = c(NA, -5L), class = "data.frame")

dput(idx)
structure(list(key = c(2, 10, 7, 4, 9), value = c(NA, NA, NA, 
NA, NA)), row.names = c(NA, -5L), class = "data.frame")

Upvotes: 5

Answers (4)

G.Fernandes

Reputation: 321

I don't know if this is the solution you are looking for, but you can do it by melting the test data.frame using the melt function from the reshape2 package.

test <- structure(list(id = c("foo", "bar", "baz", "li", "qux"), keyA = c(NA, 
1, 7, 8, 9), keyB = c(2, NA, NA, 3, NA), keyC = c(10, 6, 4, NA, 
NA)), row.names = c(NA, -5L), class = "data.frame")

library(reshape2)
melted_test <- melt(test)

melted_test
Using id as id variables
    id variable value
1  foo     keyA    NA
2  bar     keyA     1
3  baz     keyA     7
4   li     keyA     8
5  qux     keyA     9
6  foo     keyB     2
7  bar     keyB    NA
8  baz     keyB    NA
9   li     keyB     3
10 qux     keyB    NA
11 foo     keyC    10
12 bar     keyC     6
13 baz     keyC     4
14  li     keyC    NA
15 qux     keyC    NA

Then you would only have to merge idx and melted_test once and keep only the columns you want.

Upvotes: 2

utubun

Reputation: 4505

Another way to do that:

library(tidyverse)

nest(test, -id) %>%
  rowwise() %>%
  mutate(key = list(idx$key[idx$key %in% data])) %>%
  unnest(key, .drop = T) %>%
  select(key, value = 'id')

# # A tibble: 5 x 2
# key value
#   <int> <chr>
# 1     2 foo  
# 2    10 foo  
# 3     7 baz  
# 4     4 baz  
# 5     9 qux

Upvotes: 1

Sven

Reputation: 1263

A dplyr solution could be to first left_join and then coalesce:

library(dplyr)

test <- data.frame(id = c("foo", "bar", "baz", "li", "qux"), keyA = c(NA, 1, 7, 8, 9), keyB = c(2, NA, NA, 3, NA), keyC = c(10, 6, 4, NA, NA))

idx <- data.frame(key = c(2, 10, 7, 4, 9), value = c(NA, NA, NA, NA, NA))

idx <- left_join(idx, test[,c("keyA", "id")], by = c("key" = "keyA")) %>%
  left_join(test[,c("keyB", "id")], by = c("key" = "keyB")) %>%
  left_join(test[,c("keyC", "id")], by = c("key" = "keyC")) %>%
  mutate(value = coalesce(id.x,id.y,id)) %>%
  select(key, value)

Upvotes: 2

Ronak Shah

Reputation: 389235

We can gather test to long form and then left_join

library(dplyr)
library(tidyr)

idx %>%
  left_join(test %>%
             gather(key, value, -id, na.rm = TRUE), by = c("key" = "value")) %>%
   select(key, id)

#  key  id
#1   2 foo
#2  10 foo
#3   7 baz
#4   4 baz
#5   9 qux

Or as @David Arenburg mentions using data.table that would be

library(data.table)
melt(setDT(test), "id")[setDT(idx), on = .(value = key), .(key, id)]

Upvotes: 2

Joins in R with multiple OR statements

Answers (4)

Related Questions