Reputation: 23
I have two data frames:
dfA
" ID from to Lith
1 BG1 0 0.5 SED
2 BG1 0.5 0.6 GDI
3 BG1 0.6 2.8 GRN
3 ZH4 0 0.7 GRN
4 ZH4 0.7 3.0 GDI
dfB
" ID from to Weath
1 BG1 0 0.8 HW
2 BG1 0.8 1.5 SW
3 BG1 1.5 2.6 HW
4 ZH4 0 0.3 HW
5 ZH4 0.3 2.6 SW
I want the information from 'Lith' in dfA as a percentage of overlap (from 'to') in dfB. The result should look like this:
dfC
" ID from to Weath GRN GDI SED
1 BG1 0 0.8 HW 0.25 0.125 0.625
2 BG1 0.8 1.5 SW 1 0 0
3 BG1 1.5 2.6 HW 1 0 0
4 ZH4 0 0.3 HW 1 0 0
5 ZH4 0.3 2.6 SW 0.1739 0.8261 0
Note that intervals of dfA do not coincide with those of dfB and that overlaps shall only be checked for same ID. Note also that up to three overlaps in one interval of dfB are possible. Intervals of dfA are always bigger than dfB.
My attempts so far have resulted in dead ends. Splitting the df by ID is not an option since the original amount of data is immense.
Upvotes: 2
Views: 142
Reputation: 92300
Here's a possible foverlaps
solution
library(data.table)
setkey(setDT(dfA), ID, from, to)
setkey(setDT(dfB), ID, from, to)
res <- foverlaps(dfA, dfB)[, overlap := (pmin(to, i.to) - pmax(from, i.from))/(to - from)]
dcast(res, ID + from + to + Weath ~ Lith, value.var = "overlap", fill = 0)
# ID from to Weath GDI GRN SED
# 1: BG1 0.0 0.8 HW 0.125000 0.250000 0.625
# 2: BG1 0.8 1.5 SW 0.000000 1.000000 0.000
# 3: BG1 1.5 2.6 HW 0.000000 1.000000 0.000
# 4: ZH4 0.0 0.3 HW 0.000000 1.000000 0.000
# 5: ZH4 0.3 2.6 SW 0.826087 0.173913 0.000
key
by ID
and the intervals (nessacery in order for foverlpas
to know on which column to operate)foverlaps
function in order to identify overlapsoverlap
variable according to your rulesdcast
the result according to the columns of interest Upvotes: 3
Reputation: 26446
Merge the tables, perform your overlap function, reshape as desired.
library(reshape2)
m<-merge(dfB,dfA,by="ID",suffixes=c("",".y"))
overlap<-function(L1,R1,L2,R2) pmax(0,pmin(R1,R2)-pmax(L1,L2))
m$value<-overlap(m$from,m$to,m$from.y,m$to.y)/(m$to-m$from)
dcast(m,ID+from+to+Weath~Lith)
#> ID from to Weath GDI GRN SED
#> 1 BG1 0.0 0.8 HW 0.125000 0.250000 0.625
#> 2 BG1 0.8 1.5 SW 0.000000 1.000000 0.000
#> 3 BG1 1.5 2.6 HW 0.000000 1.000000 0.000
#> 4 ZH4 0.0 0.3 HW 0.000000 1.000000 NA
#> 5 ZH4 0.3 2.6 SW 0.826087 0.173913 NA
Upvotes: 1
Reputation: 44340
I would process each value of Lith (GRN, GDI, SED) one at a time, adding the generated column to dfC
. For each value of Lith, I would first find the row of dfA
corresponding to each row of dfB
using the match
function (this is the vector of row indices r
in the get.col
function below). Then I would calculate the normalized overlap in a vectorized manner using pmax
and pmin
(which is important since you say you have a large dataset).
get.col <- function(lith) {
r <- match(paste(dfB$ID, lith), paste(dfA$ID, dfA$Lith))
out <- pmax(0, pmin(dfA$to[r], dfB$to) - pmax(dfA$from[r], dfB$from)) / # Overlap
(dfB$to - dfB$from) # Size of interval in dfB
out[is.na(out)] <- 0 # Unmatched rows have no overlap
out
}
dfC <- dfB
for (lith in unique(dfA$Lith)) {
dfC[,lith] <- get.col(lith)
}
dfC
# ID from to Weath SED GDI GRN
# 1 BG1 0.0 0.8 HW 0.625 0.125000 0.250000
# 2 BG1 0.8 1.5 SW 0.000 0.000000 1.000000
# 3 BG1 1.5 2.6 HW 0.000 0.000000 1.000000
# 4 ZH4 0.0 0.3 HW 0.000 0.000000 1.000000
# 5 ZH4 0.3 2.6 SW 0.000 0.826087 0.173913
Upvotes: 1