Reputation: 1
I have the following data frame with multiple observations:
CHR END START REF ALT
1 1445 1446 G A
1 1445 1446 A G
3 2787 2787 T -
3 2787 2787 - T
And I want to delete rows if REF
column is -
and ALT
column match REF
column of another row while the other columns remains equal.
In my example thats the desired output:
CHR END START REF ALT
1 1445 1446 G A
1 1445 1446 A G
3 2787 2787 T -
I'm not sure how to connect index of differentes rows
Always in the data frame the rows to delete follows the "mother" row
Upvotes: 0
Views: 73
Reputation: 17678
you can try
library(tidyverse)
d %>%
unite(tmp, REF, ALT, remove = F) %>%
mutate(tmp=strsplit(tmp, "_") %>% map_chr(function(x) paste(sort(x), collapse ="_"))) %>%
group_by(CHR, END, START, tmp) %>%
mutate(n=ifelse(grepl("-", tmp), 1:n(), 1)) %>%
filter(n == 1) %>%
ungroup() %>%
select(-tmp, -n)
# A tibble: 3 x 5
CHR END START REF ALT
<int> <int> <int> <fct> <fct>
1 1 1445 1446 G A
2 1 1445 1446 A G
3 3 2787 2787 T -
The idea is to add an identifier tmp
with sorted ALT
, REF
values using a strsplit
and map
approach. Thus we can filter by duplicates using the counts of similar rows.
The data
d <- read.table(text=" CHR END START REF ALT
1 1445 1446 G A
1 1445 1446 A G
3 2787 2787 T -
3 2787 2787 - T", header=T)
Upvotes: 1