krista
krista

Reputation: 11

Is there a way to group polymorphisms between genes into both gene categories using R and the ifelse function?

I have two datasets. One is a list of genes that shows the nucleotide positions of each gene.

For example:

    Gene Name  Low Position   Upper Position
    Gene 1       1000           2000
    Gene 2       5000           6000

The other dataset is a list of polymorphisms and their nucleotide positions

For example:

           Position  Gene Location
    SNP 1   3000       NA
    SNP 2   5500       NA

I have used the ifelse function in R to sort my dataset of polymorphisms into their respective genes (so SNP 2 would have Gene 2 in its "Gene Location" column). The code I used was:

   SampleGeneData$Gene.Name=as.character(SampleGeneData$Gene.Name)
   SampleSNPData$Gene.Location=ifelse(sapply(SampleSNPData$Position,function(p) any(SampleGeneData$Low.Position<=p&SampleGeneData$High.Position>=p)),SampleGeneData$Gene.Name,"NO")

I was wondering if it was possible to also define the Gene Location as both of the genes that the SNP is located in between (so SNP 1 would have some output of "Gene 1 and Gene 2" or something similar). Could I do this with the ifelse function or would I have to use something else?

Upvotes: 0

Views: 32

Answers (1)

Roman
Roman

Reputation: 17678

you can use the package valr for that

library(tidyverse)
library(valr)

# data needs columns: chrom, start & end
genes <- read.table(text=" Gene  start   end
    Gene1       1000           2000
    Gene2       5000           6000", header=T) %>% 
  mutate(chrom =1)
snps <- tibble(rs=c("SNP1", "SNP2") ,start=c(3000,5500), end = c(3000,5500)+1,chrom=1)      

Identify closest intervals:

valr::bed_closest(snps, genes, overlap=T) 
# A tibble: 3 x 9
  rs.x  start.x end.x chrom Gene.y start.y end.y .overlap .dist
 <chr>    <dbl> <dbl> <dbl> <chr>    <int> <int>    <int> <int>
1 SNP1     3000  3001     1 Gene1     1000  2000        0 -1001
2 SNP2     5500  5501     1 Gene2     5000  6000        1     0
3 SNP2     5500  5501     1 Gene1     1000  2000        0 -3501

Then you can transform the data according your needs such as:

valr::bed_closest(snps, genes, overlap=T) %>% 
  group_by(rs=rs.x) %>% 
  # add a filter of maximal distance if needed
  # filter(abs(.dist) <= 5000) %>%
  summarise(Genes = toString(Gene.y)) %>% 
  right_join(snps,by = "rs")
# A tibble: 2 x 5
    rs  Genes        start   end chrom
 <chr>  <chr>        <dbl> <dbl> <dbl>
1 SNP1  Gene1         3000  3001     1
2 SNP2  Gene2, Gene1  5500  5501     1

Only intersecting intervals via:

valr::bed_intersect(snps, genes) 
# A tibble: 1 x 8
rs.x  start.x end.x chrom Gene.y start.y end.y .overlap
<chr>   <dbl> <dbl> <dbl> <chr>    <int> <int>    <int>
  1 SNP2     5500  5501     1 Gene2     5000  6000        1

Upvotes: 0

Related Questions