Popika
Popika

Reputation: 23

Mapping 2 unrelated data frames in R

I need to use data from a dataframe A to fill a column in my dataframe B.

Here is a subset of dataframe A:

> dfA <- data.frame(Family=c('A','A','A','B','B'), Count=c(1,2,3,1,2), Start=c(0,10,35,0,5), End=c(10,35,50,5,25))
> dfA
  Family Count Start End
1      A     1     0  10
2      A     2    10  35
3      A     3    35  50
4      B     1     0   5
5      B     2     5  25

and a subset of dataframe B

> dfB <- data.frame(Family=c('A','A','A','B','B'), Start=c(1,4,36,2,10), End=c(3,6,40,4,24), BelongToCount=c(NA,NA,NA,NA,NA))
> dfB
  Family Start End BelongToCount
1      A     1   3            NA
2      A     4   6            NA
3      A    36  40            NA
4      B     2   4            NA
5      B    10  24            NA

What I want to do is to fill in the BelongToCount column in B according to the data from dataframe A, which would end up with dataframe B filled as:

Family Start End  BelongToCount
A      1     3    1
A      4     6    1
A      36    40   3
B      2     4    1
B      10    24   2

I need to do this for each family (so grouping by family), and the condition to fill the BelongToCount column is that if B$Start >= A$Start && B$End <= A$End.

I can't seem to find a clean (and fast) way to do this in R. Right now, I am doing as follows:

split_A <- split(dfA, dfA$Family) 
split_A_FamilyA <- split_A[["A"]]

split_B <- split(dfB, dfB$Family)
split_B_FamilyA <- split_B[["A"]]

for(i in 1:nrow(split_B_FamilyA)) {
    row <- split_B_FamilyA[i,]
    start <- row$dStart
    end <- row$dEnd
    for(j in 1:nrow(split_A_FamilyA)) {
      row_base <- split_A_FamilyA[j,]
      start_base <- row_base$Start
      end_base <- row_base$End
      if ((start >= start_base) && (end <= end_base)) {
        split_B_FamilyA[i,][i,]$BelongToCount <- row_base$Count
        break
      }
    }
}

I admit this is a very bad way of handling the problem (and it is awfully slow). I usually use dplyr when it comes to applying operations on specific groups, but I can't find a way to do such a thing using it. Joining the tables does not make a lot of sense either because the number of rows don't match.

Can someone point me any relevant R function / an efficient way of solving this problem in R?

Upvotes: 1

Views: 85

Answers (1)

mt1022
mt1022

Reputation: 17309

You can do this with non-equi join in data.table:

library(data.table)
setDT(dfB)
setDT(dfA)
set(dfB, j='BelongToCount', value = as.numeric(dfB$BelongToCount))
dfB[dfA, BelongToCount := Count, on = .(Family, Start >= Start, End <= End)]
#    Family Start End BelongToCount
# 1:      A     1   3             1
# 2:      A     4   6             1
# 3:      A    36  40             3
# 4:      B     2   4             1
# 5:      B    10  24             2

In case a row in dfB is contained in multiple roles of dfA:

dfA2 <- rbind(dfA, dfA)
dfA2[dfB, .(BelongToCount = sum(Count)),
    on = .(Family, Start <= Start, End >= End), by = .EACHI]
#    Family Start End BelongToCount
# 1:      A     1   3             2
# 2:      A     4   6             2
# 3:      A    36  40             6
# 4:      B     2   4             2
# 5:      B    10  24             4

Upvotes: 1

Related Questions