Reputation: 3
Scenario:
2-column dataframe_1 (300,000 rows)
head(dataframe_1):
CHR POS
1 2000
1 3000
2 1500
3 3000
3-column dataframe_2 (300 rows)
head(dataframe_2):
CHR POS_START POS_END
1 1500 2500
1 3200 4000
2 1200 1600
2 2000 2200
3 5000 5500
4 1000 1200
The goal is to take dataframe_1 and compare the POS column of each row against dataframe_2 (columns POS_START and POS_END) and return a vector (length = nrow(dataframe_1)) that indicates which row of dataframe_1 lists a POS value that is within the range as indicated in dataframe_2. Note that each POS value is linked to a particular CHR value.
Example return vector:
CHR POS EXAMPLE_RETURN_VECTOR
1 2000 TRUE
1 3000 FALSE
2 1500 TRUE
3 3000 FALSE
What’s the best strategy here?
Thanks!
Upvotes: 0
Views: 203
Reputation: 540
Can we assume that each CHR
value only appears once? Can we also assume that each CHR
value that occurs in dataframe_1
is also in dataframe_2
and vice versa? Can we also assume that both tables are sorted by the CHR
value? If so, you can just create a new column like this:
dataframe_1['NEW_COLUMN'] = dataframe_1.POS >= dataframe_2.POS_START & dataframe_1.POS <= dataframe_2.POS_END
If you can't make all of those assumptions, then you could either:
1) Merge both tables where they have matching CHR
values, filter out rows with missing data, and then create the new column, or
2) Loop through each row of dataframe_1
, and for each row, compare the POS
value to the value of dataframe_2[dataframe_2.POS_START == dataframe_1.POS_START]
and dataframe_2[dataframe_2.POS_END == dataframe_1.POS_END]
Upvotes: 0
Reputation: 764
Here's an answer with dplyr
:
library(dplyr)
df1 <- tribble(
~CHR, ~POS,
1, 2000,
1, 3000,
2, 1500,
3, 3000
)
df2 <- tribble(
~CHR, ~POS_START, ~POS_END,
1, 1500, 2500,
1, 3200, 4000,
2, 1200, 1600,
2, 2000, 2200,
3, 5000, 5500,
4, 1000, 1200
)
df1 %>%
left_join(df2, by = 'CHR') %>%
mutate(IN_RANGE = POS >= POS_START & POS <= POS_END) %>%
group_by(CHR, POS) %>%
summarize(IN_RANGE = sum(IN_RANGE) > 0)
Upvotes: 1