Diesel
Diesel

Reputation: 43

Extract sequence fragments from FASTA file using coordinates on a GRanges object

I need to extract the intergenic sequences of Bacillus Subtilis.

I have the full DNA sequence of B.subtilis in R, imported with seqinr. I imported the dna sequence into R using the function read.fasta of the package seqinr;

I then created a GRanges object from Bacillus subtilis genbank file using the package "genbankr" to extract intergenic coordinates.

This is the format of my GRanges Object with intergenic coordinates:

GRanges object with 3841 ranges and 1 metadata column:
  seqnames             ranges strand |             intergenic_id
     <Rle>          <IRanges>  <Rle> |               <character>
  168      168       [   1,  409]      * | intergenic_SEQ-BEGIN_dnaA
  168      168       [1751, 1938]      * |      intergenic_dnaA_dnaN
  168      168       [3076, 3205]      * |      intergenic_dnaN_yaaA
  168      168       [3422, 3436]      * |      intergenic_yaaA_recF
  168      168       [4550, 4566]      * |      intergenic_recF_yaaB
  ...      ...                ...    ... .                       ...

Therefore in R I have: "x" (the full DNA sequence imported with seqinr) and "intergenic" (the GRanges Object with coordinates)

I have seen people asking similar question in this forum, I have tried to follow those answers unsuccessfully by using a variety of packages but I can't figure it out. I am hopeful there is a easy solution for this; any help is appreciated.

My desired output would be to produce a fasta file with the intergenic sequences in the following format:

>intergenic_SEQ-BEGIN_dnaA
ATATATATATTATTTATTTTTTTTTTTTATTATAT
>intergenic_dnaA_dnaN
ATATATCGCGTCGATCTAGACTCAGGCATG
etc.

Basically a line with the name of the intergenic sequence taken from the intergenic_id name column of my GRanges object followed by the sequence extracted from the fasta using the coordinates in the GRanges.

NOTE: in the desired output i just typed some random sequences, just as example.

Upvotes: 2

Views: 3156

Answers (1)

Osdorp
Osdorp

Reputation: 320

I also think BSGenome it is the best way (you can build your BSgenome), but you can also do it with seqinr creating your own function:

require('seqinr')
require('GenomicRanges')

# Create objects
mygenome <- read.fasta('sequence.fasta')[[1]] # I assume it is just one chromosome
mygrs <- GRanges(seqnames=rep('NC_000964',3),
                 ranges=IRanges(c(1,50,100),c(20,55,103)),
                  strand=c('*','*','*'))
mcols(mygrs)$Gene <- c('GenA','GenB','GenC')
mygrs
# GRanges object with 3 ranges and 1 metadata column:
#   seqnames     ranges strand |        Gene
# <Rle>  <IRanges>  <Rle> | <character>
#   [1] NC_000964 [  1,  20]      * |        GenA
# [2] NC_000964 [ 50,  55]      * |        GenB

# Function to subset the seqinr list
 extractSeq <- function(x) {
    if (as.character(strand(x)) == '-') {
      comp(mygenome[end(x):start(x)])
    } else {
      mygenome[start(x):end(x)]
    }
  }    

# Ex
  extractSeq(mygrs[1])
  #  [1] "a" "t" "c" "t" "t" "t" "t" "t" "c" "g" "g" "c" "t" "t" "t" "t" "t" "t" "t" "a"    

# Apply to all
  myseqs <- lapply(mygrs, extractSeq)

# write to a file
  write.fasta(myseqs, mcols(mygrs)$Gene, 'myfile.fa')

Upvotes: 1

Related Questions