Comparing one value across multiple rows in one data frame with values across multiple rows in a second data frame

Scenario:

2-column dataframe_1 (300,000 rows)

head(dataframe_1):

CHR POS

1 2000

1 3000

2 1500

3 3000

3-column dataframe_2 (300 rows)

head(dataframe_2):

CHR POS_START POS_END

1 1500 2500

1 3200 4000

2 1200 1600

2 2000 2200

3 5000 5500

4 1000 1200

The goal is to take dataframe_1 and compare the POS column of each row against dataframe_2 (columns POS_START and POS_END) and return a vector (length = nrow(dataframe_1)) that indicates which row of dataframe_1 lists a POS value that is within the range as indicated in dataframe_2. Note that each POS value is linked to a particular CHR value.

Example return vector:

CHR POS EXAMPLE_RETURN_VECTOR

1 2000 TRUE

1 3000 FALSE

2 1500 TRUE

3 3000 FALSE

What’s the best strategy here?

Thanks!

Upvotes: 0

Views: 203

Answers (2)

Tim Johns
Tim Johns

Reputation: 540

Can we assume that each CHR value only appears once? Can we also assume that each CHR value that occurs in dataframe_1 is also in dataframe_2 and vice versa? Can we also assume that both tables are sorted by the CHR value? If so, you can just create a new column like this:

dataframe_1['NEW_COLUMN'] = dataframe_1.POS >= dataframe_2.POS_START & dataframe_1.POS <= dataframe_2.POS_END

If you can't make all of those assumptions, then you could either:

1) Merge both tables where they have matching CHR values, filter out rows with missing data, and then create the new column, or

2) Loop through each row of dataframe_1, and for each row, compare the POS value to the value of dataframe_2[dataframe_2.POS_START == dataframe_1.POS_START] and dataframe_2[dataframe_2.POS_END == dataframe_1.POS_END]

Upvotes: 0

TBT8
TBT8

Reputation: 764

Here's an answer with dplyr:

library(dplyr)

df1 <- tribble(
     ~CHR, ~POS,
     1,  2000,                  
     1,  3000,
     2,  1500,
     3,  3000
)

df2 <- tribble(
     ~CHR, ~POS_START, ~POS_END,
     1, 1500, 2500,                  
     1, 3200, 4000,
     2, 1200, 1600,
     2, 2000, 2200,
     3, 5000, 5500,
     4, 1000, 1200
)

df1 %>% 
     left_join(df2, by = 'CHR') %>% 
     mutate(IN_RANGE = POS >= POS_START & POS <= POS_END) %>% 
     group_by(CHR, POS) %>% 
     summarize(IN_RANGE = sum(IN_RANGE) > 0)

Upvotes: 1

Related Questions