Reputation: 2157
I have the following two dataframes in R:
df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5))
colnames(df1) = c("X", "Y", "Z", "score")
df1
X Y Z score
1 A 1 6 1
2 A 11 20 2
3 A 21 30 3
4 B 35 40 4
5 B 45 60 5
df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10), c(5, 20, 30, 60, 30, 40, 60, 20), c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8"))
colnames(df2) = c("X", "Y", "Z", "out")
df2
X Y Z out
1 A 1 5 x1
2 A 6 20 x2
3 A 21 30 x3
4 A 50 60 x4
5 B 20 30 x5
6 B 31 40 x6
7 B 50 60 x7
8 C 10 20 x8
For every row in df1, I want to check:
This is how the output should look like:
output = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), c("x1, x2", "x2", "x3", "x4", "x5"))
colnames(output) = c("X", "Y", "Z", "score", "out")
X Y Z score out
1 A 1 6 1 x1, x2
2 A 11 20 2 x2
3 A 21 30 3 x3
4 B 35 40 4 x6
5 B 45 60 5 x7
The original df1 is kept with an extra column 'out' that is added.
Line 1 from 'output', contains 'x1, x2' in column 'out'. Why: there is a match between the values in column 'X' and range 1 to 6 overlap with lines 1 and 2 from df2.
I've asked this question before (Compare values from two dataframes and merge) where it is suggested to use the foverlaps
function. However because of the different columns between df1 and df2 and the extra rows in df2, I cannot make it work.
Upvotes: 1
Views: 1125
Reputation: 118779
Here are two possible ways, a) using the newly implemented non equi joins feature, and b) foverlaps
as you'd specifically mentioned that..
a) non-equi joins
dt2[dt1, on=.(X, Z>=Y, Y<=Z),
.(score, out=paste(out, collapse=",")),
by=.EACHI]
where dt1
and dt2
are data.tables corresponding to df1
and df2
. Note that you'll have to revert column names Z
and Y
in the result (since the column names come from dt2
but the values from dt1
.
Matching rows from dt2
corresponding to each row is dt1
is found based on the condition provided to the on
argument and .()
is evaluated for each of those matching rows (because of by=.EACHI
).
b) foverlaps
setkey(dt1, X, Y, Z)
olaps <- foverlaps(dt2, dt1, type="any", nomatch=0L)
olaps[, .(score=score[1L], out=paste(out, collapse=",")), by=.(X,Y,Z)]
Upvotes: 2
Reputation: 16121
library(dplyr)
df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45),
c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), stringsAsFactors = F)
colnames(df1) = c("X", "Y", "Z", "score")
df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10),
c(5, 20, 30, 60, 30, 40, 60, 20),
c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8"), stringsAsFactors = F)
colnames(df2) = c("X", "Y", "Z", "out")
df1 %>%
left_join(df2, by="X") %>% # join on main column
rowwise() %>% # for each row
mutate(counter = sum(seq(Y.x, Z.x) %in% seq(Y.y, Z.y))) %>% # get how many elements of those ranges overlap
filter(counter > 0) %>% # keep rows with overlap
group_by(X, Y.x, Z.x, score) %>% # for each combination of those columns
summarise(out = paste(out, collapse=", ")) %>% # combine out column
ungroup() %>%
rename(Y = Y.x,
Z = Z.x)
# # A tibble: 5 × 5
# X Y Z score out
# <chr> <dbl> <dbl> <dbl> <chr>
# 1 A 1 6 1 x1, x2
# 2 A 11 20 2 x2
# 3 A 21 30 3 x3
# 4 B 35 40 4 x6
# 5 B 45 60 5 x7
The above process is based on dplyr
package and involves a join
and some grouping and filtering. If your initial datasets (df1
, df2
) are extremely large then the join
will create an even bigger dataset that will need some time to be created.
Also, note that this process works with character
and not factor
variables. The process might convert factor
variables to character
if it tries to join factor
variables with different levels.
I'd suggest you run the chained commands step by step to see how it works and spot if I missed anything that might lead to bugs in the code.
Upvotes: 1
Reputation: 2076
Here is another options using sqldf
library(sqldf)
xx=sqldf('select t1.*,t2.out from df1 t1 left join df2 t2 on t1.X=t2.X and ((t2.Y between t1.Y and t1.Z) or (t2.Z between t1.Y and t1.Z))')
aggregate(xx[ncol(xx)], xx[-ncol(xx)], FUN = function(X) paste(unique(X), collapse=", "))
Upvotes: 0