Comparing one value across multiple rows in one data frame with values across multiple rows in a second data frame

Question

Scenario:

2-column dataframe_1 (300,000 rows)

head(dataframe_1):

CHR POS

1 2000

1 3000

2 1500

3 3000

3-column dataframe_2 (300 rows)

head(dataframe_2):

CHR POS_START POS_END

1 1500 2500

1 3200 4000

2 1200 1600

2 2000 2200

3 5000 5500

4 1000 1200

The goal is to take dataframe_1 and compare the POS column of each row against dataframe_2 (columns POS_START and POS_END) and return a vector (length = nrow(dataframe_1)) that indicates which row of dataframe_1 lists a POS value that is within the range as indicated in dataframe_2. Note that each POS value is linked to a particular CHR value.

Example return vector:

CHR POS EXAMPLE_RETURN_VECTOR

1 2000 TRUE

1 3000 FALSE

2 1500 TRUE

3 3000 FALSE

What’s the best strategy here?

Thanks!

TBT8 · Accepted Answer

Here's an answer with dplyr:

library(dplyr)

df1 <- tribble(
     ~CHR, ~POS,
     1,  2000,                  
     1,  3000,
     2,  1500,
     3,  3000
)

df2 <- tribble(
     ~CHR, ~POS_START, ~POS_END,
     1, 1500, 2500,                  
     1, 3200, 4000,
     2, 1200, 1600,
     2, 2000, 2200,
     3, 5000, 5500,
     4, 1000, 1200
)

df1 %>% 
     left_join(df2, by = 'CHR') %>% 
     mutate(IN_RANGE = POS >= POS_START & POS <= POS_END) %>% 
     group_by(CHR, POS) %>% 
     summarize(IN_RANGE = sum(IN_RANGE) > 0)

Comparing one value across multiple rows in one data frame with values across multiple rows in a second data frame

Answers (2)

Related Questions