Jadeaffe
Jadeaffe

Reputation: 23

Including information from overlapping intervals into data.frame

I have two data frames:

dfA
"   ID  from    to  Lith
1   BG1 0       0.5 SED
2   BG1 0.5     0.6 GDI
3   BG1 0.6     2.8 GRN
3   ZH4 0       0.7 GRN
4   ZH4 0.7     3.0 GDI

dfB
"   ID  from    to  Weath
1   BG1 0       0.8 HW
2   BG1 0.8     1.5 SW
3   BG1 1.5     2.6 HW
4   ZH4 0       0.3 HW
5   ZH4 0.3     2.6 SW

I want the information from 'Lith' in dfA as a percentage of overlap (from 'to') in dfB. The result should look like this:

dfC
"   ID  from    to  Weath   GRN     GDI     SED
1   BG1 0       0.8 HW      0.25    0.125   0.625
2   BG1 0.8     1.5 SW      1       0       0
3   BG1 1.5     2.6 HW      1       0       0
4   ZH4 0       0.3 HW      1       0       0
5   ZH4 0.3     2.6 SW      0.1739  0.8261  0

Note that intervals of dfA do not coincide with those of dfB and that overlaps shall only be checked for same ID. Note also that up to three overlaps in one interval of dfB are possible. Intervals of dfA are always bigger than dfB.

My attempts so far have resulted in dead ends. Splitting the df by ID is not an option since the original amount of data is immense.

Upvotes: 2

Views: 142

Answers (3)

David Arenburg
David Arenburg

Reputation: 92300

Here's a possible foverlaps solution

library(data.table)
setkey(setDT(dfA), ID, from, to)
setkey(setDT(dfB), ID, from, to)
res <- foverlaps(dfA, dfB)[, overlap := (pmin(to, i.to) - pmax(from, i.from))/(to - from)]
dcast(res, ID + from + to + Weath ~ Lith, value.var = "overlap", fill = 0)
#     ID from  to Weath      GDI      GRN   SED
# 1: BG1  0.0 0.8    HW 0.125000 0.250000 0.625
# 2: BG1  0.8 1.5    SW 0.000000 1.000000 0.000
# 3: BG1  1.5 2.6    HW 0.000000 1.000000 0.000
# 4: ZH4  0.0 0.3    HW 0.000000 1.000000 0.000
# 5: ZH4  0.3 2.6    SW 0.826087 0.173913 0.000

  • key by ID and the intervals (nessacery in order for foverlpas to know on which column to operate)
  • Run the foverlaps function in order to identify overlaps
  • Define the overlap variable according to your rules
  • Finally, dcast the result according to the columns of interest

Upvotes: 3

A. Webb
A. Webb

Reputation: 26446

Merge the tables, perform your overlap function, reshape as desired.

library(reshape2)
m<-merge(dfB,dfA,by="ID",suffixes=c("",".y"))
overlap<-function(L1,R1,L2,R2) pmax(0,pmin(R1,R2)-pmax(L1,L2))
m$value<-overlap(m$from,m$to,m$from.y,m$to.y)/(m$to-m$from)
dcast(m,ID+from+to+Weath~Lith)

#>    ID from  to Weath      GDI      GRN   SED
#> 1 BG1  0.0 0.8    HW 0.125000 0.250000 0.625
#> 2 BG1  0.8 1.5    SW 0.000000 1.000000 0.000
#> 3 BG1  1.5 2.6    HW 0.000000 1.000000 0.000
#> 4 ZH4  0.0 0.3    HW 0.000000 1.000000    NA
#> 5 ZH4  0.3 2.6    SW 0.826087 0.173913    NA

Upvotes: 1

josliber
josliber

Reputation: 44340

I would process each value of Lith (GRN, GDI, SED) one at a time, adding the generated column to dfC. For each value of Lith, I would first find the row of dfA corresponding to each row of dfB using the match function (this is the vector of row indices r in the get.col function below). Then I would calculate the normalized overlap in a vectorized manner using pmax and pmin (which is important since you say you have a large dataset).

get.col <- function(lith) {
  r <- match(paste(dfB$ID, lith), paste(dfA$ID, dfA$Lith))
  out <- pmax(0, pmin(dfA$to[r], dfB$to) - pmax(dfA$from[r], dfB$from)) /  # Overlap
    (dfB$to - dfB$from)  # Size of interval in dfB
  out[is.na(out)] <- 0  # Unmatched rows have no overlap
  out
}

dfC <- dfB
for (lith in unique(dfA$Lith)) {
  dfC[,lith] <- get.col(lith)
}
dfC
#    ID from  to Weath   SED      GDI      GRN
# 1 BG1  0.0 0.8    HW 0.625 0.125000 0.250000
# 2 BG1  0.8 1.5    SW 0.000 0.000000 1.000000
# 3 BG1  1.5 2.6    HW 0.000 0.000000 1.000000
# 4 ZH4  0.0 0.3    HW 0.000 0.000000 1.000000
# 5 ZH4  0.3 2.6    SW 0.000 0.826087 0.173913

Upvotes: 1

Related Questions