user1987607
user1987607

Reputation: 2157

Compare and merge two dataframes

I have the following two dataframes in R:

df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5))
colnames(df1) = c("X", "Y", "Z", "score")

df1 
  X  Y  Z score
1 A  1  6     1
2 A 11 20     2
3 A 21 30     3
4 B 35 40     4
5 B 45 60     5

df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10), c(5, 20, 30, 60, 30, 40, 60, 20), c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8"))
colnames(df2) = c("X", "Y", "Z", "out")

df2
  X  Y  Z out
1 A  1  5  x1
2 A  6 20  x2
3 A 21 30  x3
4 A 50 60  x4 
5 B 20 30  x5
6 B 31 40  x6
7 B 50 60  x7
8 C 10 20  x8

For every row in df1, I want to check:

This is how the output should look like:

output = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), c("x1, x2", "x2", "x3", "x4", "x5"))
colnames(output) = c("X", "Y", "Z", "score", "out")

  X  Y  Z score    out
1 A  1  6     1 x1, x2
2 A 11 20     2     x2
3 A 21 30     3     x3
4 B 35 40     4     x6
5 B 45 60     5     x7

The original df1 is kept with an extra column 'out' that is added.

Line 1 from 'output', contains 'x1, x2' in column 'out'. Why: there is a match between the values in column 'X' and range 1 to 6 overlap with lines 1 and 2 from df2.

I've asked this question before (Compare values from two dataframes and merge) where it is suggested to use the foverlaps function. However because of the different columns between df1 and df2 and the extra rows in df2, I cannot make it work.

Upvotes: 1

Views: 1125

Answers (3)

Arun
Arun

Reputation: 118779

Here are two possible ways, a) using the newly implemented non equi joins feature, and b) foverlaps as you'd specifically mentioned that..

a) non-equi joins

dt2[dt1, on=.(X, Z>=Y, Y<=Z), 
      .(score, out=paste(out, collapse=",")), 
    by=.EACHI]

where dt1 and dt2 are data.tables corresponding to df1 and df2. Note that you'll have to revert column names Z and Y in the result (since the column names come from dt2 but the values from dt1.

Matching rows from dt2 corresponding to each row is dt1 is found based on the condition provided to the on argument and .() is evaluated for each of those matching rows (because of by=.EACHI).

b) foverlaps

setkey(dt1, X, Y, Z)
olaps <- foverlaps(dt2, dt1, type="any", nomatch=0L)
olaps[, .(score=score[1L], out=paste(out, collapse=",")), by=.(X,Y,Z)]

Upvotes: 2

AntoniosK
AntoniosK

Reputation: 16121

library(dplyr)

df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), 
                 c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), stringsAsFactors = F)
colnames(df1) = c("X", "Y", "Z", "score")

df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10), 
                 c(5, 20, 30, 60, 30, 40, 60, 20), 
                 c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8"), stringsAsFactors = F)
colnames(df2) = c("X", "Y", "Z", "out")


df1 %>%
  left_join(df2, by="X") %>%         # join on main column
  rowwise() %>%                      # for each row
  mutate(counter = sum(seq(Y.x, Z.x) %in% seq(Y.y, Z.y))) %>%   # get how many elements of those ranges overlap
  filter(counter > 0) %>%            # keep rows with overlap
  group_by(X, Y.x, Z.x, score) %>%   # for each combination of those columns
  summarise(out = paste(out, collapse=", ")) %>%                # combine out column
  ungroup() %>%
  rename(Y = Y.x,
         Z = Z.x)

# # A tibble: 5 × 5
#       X     Y     Z score    out
#    <chr> <dbl> <dbl> <dbl> <chr>
# 1     A     1     6     1 x1, x2
# 2     A    11    20     2     x2
# 3     A    21    30     3     x3
# 4     B    35    40     4     x6
# 5     B    45    60     5     x7

The above process is based on dplyr package and involves a join and some grouping and filtering. If your initial datasets (df1, df2) are extremely large then the join will create an even bigger dataset that will need some time to be created.

Also, note that this process works with character and not factor variables. The process might convert factor variables to character if it tries to join factor variables with different levels.

I'd suggest you run the chained commands step by step to see how it works and spot if I missed anything that might lead to bugs in the code.

Upvotes: 1

Chirayu Chamoli
Chirayu Chamoli

Reputation: 2076

Here is another options using sqldf

library(sqldf)
xx=sqldf('select t1.*,t2.out from df1 t1 left join df2 t2 on t1.X=t2.X and ((t2.Y between t1.Y and t1.Z) or (t2.Z between t1.Y and t1.Z))')
aggregate(xx[ncol(xx)], xx[-ncol(xx)], FUN = function(X) paste(unique(X), collapse=", "))

Upvotes: 0

Related Questions