Filter for genes that contain significant snps

Question

I have two dataframes: a list of SNPs from the output of GWAS, and a list of gene start/end coordinates. I want to filter (using dplyr package) to extract only those genes which have a SNP whose Position falls within their start/end boundaries.

I imagine %in% might be the right way to go here, but I'm struggling with the fact that the gene coordinates are a range of values. Thus I can't just look for rows where the SNP position matches a gene position.

I've seen solutions using BiomaRt package and others, but I'm looking for a dplyr solution. Thanks in advance.

Gene dataframe:

Gene   Start   End
gene1  1       5
gene2  10      15
gene3  20      25
gene4  30      35

SNP dataframe:

Position    SNP_ID
6           ss1
8           ss2
9           ss3
11          ss4
16          ss5
19          ss6
27          ss7
34          ss8

Desired output:

Gene   Start   End
gene2  10      15
gene4  30      35

Artem Sokolov · Accepted Answer

The task is to identify genes that have at least one SNP in them. We can do this by traversing the pairs of Start and End positions with map2 and asking if any of the SNPs positions land between them:

library( tidyverse )

dfg %>% mutate( AnyHits = map2_lgl(Start, End, ~any(dfs$Position %in% seq(.x,.y))) )
# # A tibble: 4 x 4
#   Gene  Start   End AnyHits
#        
# 1 gene1     1     5 FALSE  
# 2 gene2    10    15 TRUE   
# 3 gene3    20    25 FALSE  
# 4 gene4    30    35 TRUE

From here, it's just a simple %>% filter(AnyHits) to reduce your data frame to the rows that had at least one SNP hit.

Data:

# Genes
dfg <- tibble( Gene = str_c("gene",1:4),
               Start = c(1,10,20,30),
               End = c(5,15,25,35) )

# SNPs
dfs <- tibble( Position = c(6,8,9,11,16,19,27,34),
               SNP_ID = str_c("ss",1:8) )

Filter for genes that contain significant snps

Answers (1)

Related Questions