Reputation: 97
I have two dataframes: a list of SNPs from the output of GWAS, and a list of gene start/end coordinates. I want to filter (using dplyr package) to extract only those genes which have a SNP whose Position
falls within their start/end boundaries.
I imagine %in%
might be the right way to go here, but I'm struggling with the fact that the gene coordinates are a range of values. Thus I can't just look for rows where the SNP position matches a gene position.
I've seen solutions using BiomaRt package and others, but I'm looking for a dplyr solution. Thanks in advance.
Gene dataframe:
Gene Start End
gene1 1 5
gene2 10 15
gene3 20 25
gene4 30 35
SNP dataframe:
Position SNP_ID
6 ss1
8 ss2
9 ss3
11 ss4
16 ss5
19 ss6
27 ss7
34 ss8
Desired output:
Gene Start End
gene2 10 15
gene4 30 35
Upvotes: 0
Views: 217
Reputation: 13691
The task is to identify genes that have at least one SNP in them. We can do this by traversing the pairs of Start
and End
positions with map2
and asking if any of the SNPs positions land between them:
library( tidyverse )
dfg %>% mutate( AnyHits = map2_lgl(Start, End, ~any(dfs$Position %in% seq(.x,.y))) )
# # A tibble: 4 x 4
# Gene Start End AnyHits
# <chr> <dbl> <dbl> <lgl>
# 1 gene1 1 5 FALSE
# 2 gene2 10 15 TRUE
# 3 gene3 20 25 FALSE
# 4 gene4 30 35 TRUE
From here, it's just a simple %>% filter(AnyHits)
to reduce your data frame to the rows that had at least one SNP hit.
Data:
# Genes
dfg <- tibble( Gene = str_c("gene",1:4),
Start = c(1,10,20,30),
End = c(5,15,25,35) )
# SNPs
dfs <- tibble( Position = c(6,8,9,11,16,19,27,34),
SNP_ID = str_c("ss",1:8) )
Upvotes: 2