Reputation: 11
I have two datasets. One is a list of genes that shows the nucleotide positions of each gene.
For example:
Gene Name Low Position Upper Position
Gene 1 1000 2000
Gene 2 5000 6000
The other dataset is a list of polymorphisms and their nucleotide positions
For example:
Position Gene Location
SNP 1 3000 NA
SNP 2 5500 NA
I have used the ifelse function in R to sort my dataset of polymorphisms into their respective genes (so SNP 2 would have Gene 2 in its "Gene Location" column). The code I used was:
SampleGeneData$Gene.Name=as.character(SampleGeneData$Gene.Name)
SampleSNPData$Gene.Location=ifelse(sapply(SampleSNPData$Position,function(p) any(SampleGeneData$Low.Position<=p&SampleGeneData$High.Position>=p)),SampleGeneData$Gene.Name,"NO")
I was wondering if it was possible to also define the Gene Location as both of the genes that the SNP is located in between (so SNP 1 would have some output of "Gene 1 and Gene 2" or something similar). Could I do this with the ifelse function or would I have to use something else?
Upvotes: 0
Views: 32
Reputation: 17678
you can use the package valr
for that
library(tidyverse)
library(valr)
# data needs columns: chrom, start & end
genes <- read.table(text=" Gene start end
Gene1 1000 2000
Gene2 5000 6000", header=T) %>%
mutate(chrom =1)
snps <- tibble(rs=c("SNP1", "SNP2") ,start=c(3000,5500), end = c(3000,5500)+1,chrom=1)
Identify closest intervals:
valr::bed_closest(snps, genes, overlap=T)
# A tibble: 3 x 9
rs.x start.x end.x chrom Gene.y start.y end.y .overlap .dist
<chr> <dbl> <dbl> <dbl> <chr> <int> <int> <int> <int>
1 SNP1 3000 3001 1 Gene1 1000 2000 0 -1001
2 SNP2 5500 5501 1 Gene2 5000 6000 1 0
3 SNP2 5500 5501 1 Gene1 1000 2000 0 -3501
Then you can transform the data according your needs such as:
valr::bed_closest(snps, genes, overlap=T) %>%
group_by(rs=rs.x) %>%
# add a filter of maximal distance if needed
# filter(abs(.dist) <= 5000) %>%
summarise(Genes = toString(Gene.y)) %>%
right_join(snps,by = "rs")
# A tibble: 2 x 5
rs Genes start end chrom
<chr> <chr> <dbl> <dbl> <dbl>
1 SNP1 Gene1 3000 3001 1
2 SNP2 Gene2, Gene1 5500 5501 1
Only intersecting intervals via:
valr::bed_intersect(snps, genes)
# A tibble: 1 x 8
rs.x start.x end.x chrom Gene.y start.y end.y .overlap
<chr> <dbl> <dbl> <dbl> <chr> <int> <int> <int>
1 SNP2 5500 5501 1 Gene2 5000 6000 1
Upvotes: 0