Reputation: 43
I need to extract the intergenic sequences of Bacillus Subtilis.
I have the full DNA sequence of B.subtilis in R, imported with seqinr. I imported the dna sequence into R using the function read.fasta of the package seqinr;
I then created a GRanges object from Bacillus subtilis genbank file using the package "genbankr" to extract intergenic coordinates.
This is the format of my GRanges Object with intergenic coordinates:
GRanges object with 3841 ranges and 1 metadata column:
seqnames ranges strand | intergenic_id
<Rle> <IRanges> <Rle> | <character>
168 168 [ 1, 409] * | intergenic_SEQ-BEGIN_dnaA
168 168 [1751, 1938] * | intergenic_dnaA_dnaN
168 168 [3076, 3205] * | intergenic_dnaN_yaaA
168 168 [3422, 3436] * | intergenic_yaaA_recF
168 168 [4550, 4566] * | intergenic_recF_yaaB
... ... ... ... . ...
Therefore in R I have: "x" (the full DNA sequence imported with seqinr) and "intergenic" (the GRanges Object with coordinates)
I have seen people asking similar question in this forum, I have tried to follow those answers unsuccessfully by using a variety of packages but I can't figure it out. I am hopeful there is a easy solution for this; any help is appreciated.
My desired output would be to produce a fasta file with the intergenic sequences in the following format:
>intergenic_SEQ-BEGIN_dnaA
ATATATATATTATTTATTTTTTTTTTTTATTATAT
>intergenic_dnaA_dnaN
ATATATCGCGTCGATCTAGACTCAGGCATG
etc.
Basically a line with the name of the intergenic sequence taken from the intergenic_id name column of my GRanges object followed by the sequence extracted from the fasta using the coordinates in the GRanges.
NOTE: in the desired output i just typed some random sequences, just as example.
Upvotes: 2
Views: 3156
Reputation: 320
I also think BSGenome it is the best way (you can build your BSgenome), but you can also do it with seqinr creating your own function:
require('seqinr')
require('GenomicRanges')
# Create objects
mygenome <- read.fasta('sequence.fasta')[[1]] # I assume it is just one chromosome
mygrs <- GRanges(seqnames=rep('NC_000964',3),
ranges=IRanges(c(1,50,100),c(20,55,103)),
strand=c('*','*','*'))
mcols(mygrs)$Gene <- c('GenA','GenB','GenC')
mygrs
# GRanges object with 3 ranges and 1 metadata column:
# seqnames ranges strand | Gene
# <Rle> <IRanges> <Rle> | <character>
# [1] NC_000964 [ 1, 20] * | GenA
# [2] NC_000964 [ 50, 55] * | GenB
# Function to subset the seqinr list
extractSeq <- function(x) {
if (as.character(strand(x)) == '-') {
comp(mygenome[end(x):start(x)])
} else {
mygenome[start(x):end(x)]
}
}
# Ex
extractSeq(mygrs[1])
# [1] "a" "t" "c" "t" "t" "t" "t" "t" "c" "g" "g" "c" "t" "t" "t" "t" "t" "t" "t" "a"
# Apply to all
myseqs <- lapply(mygrs, extractSeq)
# write to a file
write.fasta(myseqs, mcols(mygrs)$Gene, 'myfile.fa')
Upvotes: 1