Reputation: 135
Is there a way to convert hundreds of SEQ files to FASTA format
the seq files contain only the sequence in text format.
ATGCGATCGGACTGACTAGCTACGTACG
ACATCCATCATTATTCTATCTATCTATC
ACTATTCATCTATCTTACTATCTTACTC
AATCATTTCATTA
How can I append the file name of each individual text file as the string ID?
I tried applying code from this thread, like this:
files1 <- list.files(pattern = "*.seq")
files1
head(files1)
for (i in 1:length(files1)) {
logFile = read.table(paste0(files1[i]))
write.table(rbind(paste0(">",files1[i]),logFile),paste0(files1[i],".fa"),row.names = FALSE,col.names = FALSE,quote = FALSE)
}
but it did not work, the output would just be a +
Upvotes: 0
Views: 802
Reputation: 1373
I had to do this for a .seq
file, which is generated by lasergene DNA (or DNAStar I think).
The format of the .seq
is:
"Contig 2" (1,1412)
Contig Length: 1412 bases
Average Length/Sequence: 757 bases
Total Sequence Length: 4544 bases
Top Strand: 4 sequences
Bottom Strand: 2 sequences
Total: 6 sequences
FEATURES Location/Qualifiers
contig 1..1412
/Note="Contig 2(1>1412)"
/dnas_scaffold_ID=0
/dnas_scaffold_POS=0
coverage_below 1..568
/Note="Below threshold"
coverage_one 569..749
/Note="One_strand"
coverage_below 750..1331
/Note="Below threshold"
coverage_one 1332..1412
/Note="One_strand"
^^
ATGC
The sequence data always proceeded ^^
. So I wrote this simple function to read in the .seq file (plan text), and write out fasta file with the file name as the header.
convert_seq_to_fasta = function(path){
# read in file
lines = readLines(path)
# find where ^^ is - fasta data is the next line
start = which(lines %in% "^^") + 1
# get name and create output name
file_name = gsub(".seq", "", path)
output = paste0(file_name, ".fasta")
# create fasta header and store fasta body
fasta_header = paste0(">", file_name)
fasta_body = lines[start]
# write out
cat(fasta_header, file = output, sep = "\n")
cat(fasta_body, file = output, append = TRUE)
}
Use it like this:
seq_files = list.files(pattern = "*.seq$")
for (files in seq_files) {
convert_seq_to_fasta(files)
}
This assumes the .seq
files are in the same directory as the script (so save it first).
If your .seq
files have this format, assuming the file name is rando.seq
:
ATGCGATCGGACTGACTAGCTACGTACG
ACATCCATCATTATTCTATCTATCTATC
ACTATTCATCTATCTTACTATCTTACTC
AATCATTTCATTA
And you want this output:
>rando
ATGCGATCGGACTGACTAGCTACGTACGACATCCATCATTATTCTATCTATCTATCACTATTCATCTATCTTACTATCTTACTCAATCATTTCATT
Which is header + sequence data on one line then you can use this function:
convert_odd_to_fasta = function(path){
lines = readLines(path)
file_name = gsub(".seq", "", path)
output = paste0(file_name, ".fasta")
fasta_header = paste0(">", file_name)
fasta_body = paste0(lines, collapse = '')
cat(fasta_header, file = output, sep = "\n")
cat(fasta_body, file = output, append = TRUE)
}
Use it the same as above.
Hope that helps!
Upvotes: 1