zoe
zoe

Reputation: 311

compare multiple columns in one or two data frames r

I have a data frame:

name<-c('a','b','c','d','e')
type<-c('x','x','y','x','y')
chr<- c('ch1','ch1','ch1','ch2','ch2')
pos<- c(5000, 5100, 4999,5500,5100)
df<-data.frame(name,type, chr,pos)

df
I would like to go through each row and if type is NOT equal, AND chr IS equal, AND pos is within abs(100) then create a new df with the matches (with a new column of the name of the match). For the above df rows 1 and 3 would match and the result would be

enter image description here

Ideally I don't want the reciprocal match and so I would want

enter image description here

If it is easier I can split into two dfs according to type.

I have tried a variation of merge and filter (dplyr) but am getting nowhere.

Upvotes: 0

Views: 896

Answers (3)

chinsoon12
chinsoon12

Reputation: 25225

An data.table option using a non-equi join, which should be faster for a large dataset:

library(data.table)
setDT(df)[, c("s", "e") := .(pos - 100, pos + 100)]

#perform non-equi join based on desired conditions
pair <- df[df, on=.(chr, s<=pos, e>=pos), nomatch=0L,
    .(name=i.name[x.type!=i.type], match=x.name[x.type!=i.type])]

#extract rows with matches while removing reciprocals
df[unique(pair[, .(name=pmin(name, match), match=pmax(name, match))]), on=.(name)]

output:

   name type chr  pos    s    e match
1:    a    x ch1 5000 4900 5100     c

Upvotes: 0

cardinal40
cardinal40

Reputation: 1263

I'm sure others here will come up with something more elegant than row slicing, but this seems to work if you a full data frame of all the matches:

library(tidyverse)

find_matches <- function(i) {
  row_of_interest <- df[i, ]
  df_rest <- df[-i, ]
  names(df_rest) <- str_c(names(df_rest), ".x")

  df_rest %>% 
    cbind(row_of_interest) %>% 
    filter(type != type.x, abs(pos - pos.x) < 100) %>% 
    transmute(name, type, chr, pos, match = name.x)
}

map_dfr(1:5, find_matches)

  name type chr  pos match
1    a    x ch1 5000     c
2    b    x ch1 5100     e
3    c    y ch1 4999     a
4    e    y ch2 5100     b

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388862

We can split the data based on type column, do a full_join by 'chr' column and filter the rows where absolute values between pos columns is less than 100.

library(dplyr)

df %>%
  group_split(type) %>%
  purrr::reduce(full_join, by = 'chr') %>%
  filter(abs(pos.x - pos.y) < 100)

# A tibble: 1 x 7
#  name.x type.x chr   pos.x name.y type.y pos.y
#  <fct>  <fct>  <fct> <dbl> <fct>  <fct>  <dbl>
#1  a      x      ch1    5000 c      y       4999

You can then remove any unwanted columns and rename them as per your requirement.

Upvotes: 4

Related Questions