Reputation: 357
I have two dataframes that I want to compare. I already know that the values in dataframe one (df1$BP) are not within the range of values in dataframe two (df2$START and df2$STOP), but I want to return the row in dataframe two where df2$START or df2$STOP is closest in value to df1$BP, where the "Chr" column matches between datasets (df1$Chr, df2$Chr).
I have managed to do this (see bottom of Q), but it seems SERIOUSLY unwieldy, and I wondered if there was a more parsimonious manner of achieving the same thing.
So for dataframe one (df1), we have:
df1=data.frame(SNP=c("rs79247094","rs13325007"),
Chr=c(2,3),
BP=c(48554955,107916058))
df1
SNP Chr BP
rs79247094 2 48554955
rs13325007 3 107916058
For dataframe two, we have:
df2=data.frame(clump=c(1,2,3,4,5,6,7,8),
Chr=c(2,2,2,2,3,3,3,3),
START=c(28033538,37576136,58143438,60389362,80814042,107379837,136288405,161777035),
STOP=c(27451538,36998607,57845065,60242162,79814042,107118837,135530405,161092491))
df2
Clump Chr START STOP
1 2 28033538 27451538
2 2 37576136 36998607
3 2 58143438 57845065
4 2 60389362 60242162
5 3 80814042 79814042
6 3 107379837 107118837
7 3 136288405 135530405
8 3 161777035 161092491
I am interested in returning which START/STOP values are closest to BP. Ideally, I could return the row, and what the difference between BP and START or STOP is (df3$Dist), like:
df3
Clump Chr START STOP SNP BP Dist
3 2 58143438 57845065 rs79247094 48554955 9290110
6 3 107379837 107118837 rs13325007 107916058 536221
I have found similar questions, for example: Return rows establishing a "closest value to" in R
But these are finding the closest values based on a fixed value, rather than a value that changes (and matching on the Chr column).
My long winded method is:
df3<-right_join(df1,df2,by="Chr")
to give me all of the combinations of df1 and df2 together.
df3$start_dist<-abs(df3$START-df3$BP)
to create a column with the absolute difference between START and BP
df3$stop_dist<-abs(df3$STOP-df3$BP)
to create a column with the absolute difference between STOP and BP
df3$dist.compare<-ifelse(df3$start_dist<df3$stop_dist,df3$start_dist,df3$stop_dist)
df3<-df3[with(df3,order(SNP,"dist.compare")),]
to create a column (dist.compare) which prints the smallest difference between BP and START or STOP (and the re-order by that column)
df3<- df3 %>% group_by(SNP) %>% mutate(Dist = first(dist.compare))
to create a column (Dist) which prints the minimum value from df3$dist.compare
df3<-df3[which(df3$dist.compare==df3$Dist),c("clump","Chr","START","STOP","SNP","BP","Dist")]
df3<-df3[order(df3$clump),]
to only print rows where dist.compare matches Dist (so the minimum value), and drop the intermediate columns, and tidy up by re-ordering by clump. Now that gets me to where I want to be:
df3
Clump Chr START STOP SNP BP Dist
3 2 58143438 57845065 rs79247094 48554955 9290110
6 3 107379837 107118837 rs13325007 107916058 536221
But I feel like it is very very convoluted, and wondered if anyone had any tips on how to refine that process?
thanks in advance
Upvotes: 1
Views: 818
Reputation: 7405
Following the logic you've laid out in your syntax, here is a dplyr
solution that is a bit cleaner:
right_join
your dataframes
Create a variable dist.compare
based on absolute values
Group by SNP
Filter to keep the smallest distance
Select variables in the order you'd like for your final dataframe. Note you can rename variables in a dplyr::select
statement (Dist = dist.compare
)
Order values by clump
library(dplyr)
df3 <- right_join(df1, df2, by = "Chr") %>%
mutate(dist.compare = ifelse(abs(START - BP) < abs(STOP - BP), abs(START - BP), abs(STOP - BP))) %>%
group_by(SNP) %>%
filter(dist.compare == min(dist.compare)) %>%
select(clump, Chr, START, STOP, SNP, BP, Dist = dist.compare) %>%
arrange(clump)
This gives us:
clump Chr START STOP SNP BP Dist
<dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 3 2 58143438 57845065 rs79247094 48554955 9290110
2 6 3 107379837 107118837 rs13325007 107916058 536221
Upvotes: 2