Lynsey
Lynsey

Reputation: 357

Find nearest entry between dataframes in R

I have two dataframes that I want to compare. I already know that the values in dataframe one (df1$BP) are not within the range of values in dataframe two (df2$START and df2$STOP), but I want to return the row in dataframe two where df2$START or df2$STOP is closest in value to df1$BP, where the "Chr" column matches between datasets (df1$Chr, df2$Chr).

I have managed to do this (see bottom of Q), but it seems SERIOUSLY unwieldy, and I wondered if there was a more parsimonious manner of achieving the same thing.

So for dataframe one (df1), we have:

 df1=data.frame(SNP=c("rs79247094","rs13325007"),
           Chr=c(2,3),
           BP=c(48554955,107916058))

df1

SNP         Chr    BP
rs79247094    2     48554955
rs13325007    3    107916058

For dataframe two, we have:

 df2=data.frame(clump=c(1,2,3,4,5,6,7,8),
           Chr=c(2,2,2,2,3,3,3,3),
           START=c(28033538,37576136,58143438,60389362,80814042,107379837,136288405,161777035),
           STOP=c(27451538,36998607,57845065,60242162,79814042,107118837,135530405,161092491))

df2

 Clump    Chr      START       STOP
     1      2   28033538   27451538
     2      2   37576136   36998607
     3      2   58143438   57845065
     4      2   60389362   60242162
     5      3   80814042   79814042
     6      3  107379837  107118837
     7      3  136288405  135530405
     8      3  161777035  161092491

I am interested in returning which START/STOP values are closest to BP. Ideally, I could return the row, and what the difference between BP and START or STOP is (df3$Dist), like:

df3

 Clump   Chr      START       STOP         SNP        BP       Dist
     3     2   58143438   57845065  rs79247094  48554955    9290110
     6     3  107379837  107118837  rs13325007  107916058    536221

I have found similar questions, for example: Return rows establishing a "closest value to" in R

But these are finding the closest values based on a fixed value, rather than a value that changes (and matching on the Chr column).

My long winded method is:

df3<-right_join(df1,df2,by="Chr")

to give me all of the combinations of df1 and df2 together.

df3$start_dist<-abs(df3$START-df3$BP)

to create a column with the absolute difference between START and BP

df3$stop_dist<-abs(df3$STOP-df3$BP)

to create a column with the absolute difference between STOP and BP

df3$dist.compare<-ifelse(df3$start_dist<df3$stop_dist,df3$start_dist,df3$stop_dist)
df3<-df3[with(df3,order(SNP,"dist.compare")),]

to create a column (dist.compare) which prints the smallest difference between BP and START or STOP (and the re-order by that column)

df3<- df3 %>%   group_by(SNP) %>%   mutate(Dist = first(dist.compare))

to create a column (Dist) which prints the minimum value from df3$dist.compare

df3<-df3[which(df3$dist.compare==df3$Dist),c("clump","Chr","START","STOP","SNP","BP","Dist")]
df3<-df3[order(df3$clump),]

to only print rows where dist.compare matches Dist (so the minimum value), and drop the intermediate columns, and tidy up by re-ordering by clump. Now that gets me to where I want to be:

df3

 Clump   Chr      START       STOP         SNP        BP       Dist
     3     2   58143438   57845065  rs79247094  48554955    9290110
     6     3  107379837  107118837  rs13325007  107916058    536221

But I feel like it is very very convoluted, and wondered if anyone had any tips on how to refine that process?

thanks in advance

Upvotes: 1

Views: 818

Answers (1)

Matt
Matt

Reputation: 7405

Following the logic you've laid out in your syntax, here is a dplyr solution that is a bit cleaner:

  1. right_join your dataframes

  2. Create a variable dist.compare based on absolute values

  3. Group by SNP

  4. Filter to keep the smallest distance

  5. Select variables in the order you'd like for your final dataframe. Note you can rename variables in a dplyr::select statement (Dist = dist.compare)

  6. Order values by clump

     library(dplyr)
    
     df3 <- right_join(df1, df2, by = "Chr") %>% 
       mutate(dist.compare = ifelse(abs(START - BP) < abs(STOP - BP), abs(START - BP), abs(STOP - BP))) %>% 
       group_by(SNP) %>% 
       filter(dist.compare == min(dist.compare)) %>% 
       select(clump, Chr, START, STOP, SNP, BP, Dist = dist.compare) %>% 
       arrange(clump)
    

This gives us:

  clump   Chr     START      STOP SNP               BP    Dist
  <dbl> <dbl>     <dbl>     <dbl> <chr>          <dbl>   <dbl>
1     3     2  58143438  57845065 rs79247094  48554955 9290110
2     6     3 107379837 107118837 rs13325007 107916058  536221

Upvotes: 2

Related Questions