Reputation: 311
I have a data frame:
name<-c('a','b','c','d','e')
type<-c('x','x','y','x','y')
chr<- c('ch1','ch1','ch1','ch2','ch2')
pos<- c(5000, 5100, 4999,5500,5100)
df<-data.frame(name,type, chr,pos)
I would like to go through each row and if type is NOT equal, AND chr IS equal, AND pos is within abs(100) then create a new df with the matches (with a new column of the name of the match).
For the above df rows 1 and 3 would match and the result would be
Ideally I don't want the reciprocal match and so I would want
If it is easier I can split into two dfs according to type.
I have tried a variation of merge and filter (dplyr) but am getting nowhere.
Upvotes: 0
Views: 896
Reputation: 25225
An data.table
option using a non-equi join, which should be faster for a large dataset:
library(data.table)
setDT(df)[, c("s", "e") := .(pos - 100, pos + 100)]
#perform non-equi join based on desired conditions
pair <- df[df, on=.(chr, s<=pos, e>=pos), nomatch=0L,
.(name=i.name[x.type!=i.type], match=x.name[x.type!=i.type])]
#extract rows with matches while removing reciprocals
df[unique(pair[, .(name=pmin(name, match), match=pmax(name, match))]), on=.(name)]
output:
name type chr pos s e match
1: a x ch1 5000 4900 5100 c
Upvotes: 0
Reputation: 1263
I'm sure others here will come up with something more elegant than row slicing, but this seems to work if you a full data frame of all the matches:
library(tidyverse)
find_matches <- function(i) {
row_of_interest <- df[i, ]
df_rest <- df[-i, ]
names(df_rest) <- str_c(names(df_rest), ".x")
df_rest %>%
cbind(row_of_interest) %>%
filter(type != type.x, abs(pos - pos.x) < 100) %>%
transmute(name, type, chr, pos, match = name.x)
}
map_dfr(1:5, find_matches)
name type chr pos match
1 a x ch1 5000 c
2 b x ch1 5100 e
3 c y ch1 4999 a
4 e y ch2 5100 b
Upvotes: 1
Reputation: 388862
We can split the data based on type
column, do a full_join
by
'chr'
column and filter
the rows where absolute values between pos
columns is less than 100.
library(dplyr)
df %>%
group_split(type) %>%
purrr::reduce(full_join, by = 'chr') %>%
filter(abs(pos.x - pos.y) < 100)
# A tibble: 1 x 7
# name.x type.x chr pos.x name.y type.y pos.y
# <fct> <fct> <fct> <dbl> <fct> <fct> <dbl>
#1 a x ch1 5000 c y 4999
You can then remove any unwanted columns and rename them as per your requirement.
Upvotes: 4