Tyler Ruddenfort
Tyler Ruddenfort

Reputation: 135

convert multiple SEQ files to fasta format

Is there a way to convert hundreds of SEQ files to FASTA format

the seq files contain only the sequence in text format.

ATGCGATCGGACTGACTAGCTACGTACG
ACATCCATCATTATTCTATCTATCTATC
ACTATTCATCTATCTTACTATCTTACTC
AATCATTTCATTA

How can I append the file name of each individual text file as the string ID?

I tried applying code from this thread, like this:

files1 <- list.files(pattern = "*.seq")   
files1 
head(files1) 
for (i in 1:length(files1)) {   
  logFile = read.table(paste0(files1[i]))      
  write.table(rbind(paste0(">",files1[i]),logFile),paste0(files1[i],".fa"),row.names = FALSE,col.names = FALSE,quote = FALSE) 
}

but it did not work, the output would just be a +

Upvotes: 0

Views: 802

Answers (1)

Amar
Amar

Reputation: 1373

I had to do this for a .seq file, which is generated by lasergene DNA (or DNAStar I think).

The format of the .seq is:

"Contig 2" (1,1412)
  Contig Length:                 1412 bases
  Average Length/Sequence:        757 bases
  Total Sequence Length:         4544 bases
  Top Strand:                       4 sequences
  Bottom Strand:                    2 sequences
  Total:                            6 sequences
FEATURES             Location/Qualifiers
     contig          1..1412
                     /Note="Contig 2(1>1412)"
                     /dnas_scaffold_ID=0
                     /dnas_scaffold_POS=0
     coverage_below  1..568
                     /Note="Below threshold"
     coverage_one    569..749
                     /Note="One_strand"
     coverage_below  750..1331
                     /Note="Below threshold"
     coverage_one    1332..1412
                     /Note="One_strand"

^^
ATGC

The sequence data always proceeded ^^. So I wrote this simple function to read in the .seq file (plan text), and write out fasta file with the file name as the header.

convert_seq_to_fasta = function(path){
  
  # read in file
  lines = readLines(path)
  # find where ^^ is - fasta data is the next line
  start = which(lines %in% "^^") + 1
  
  # get name and create output name
  file_name = gsub(".seq", "", path)
  output = paste0(file_name, ".fasta")
  
  # create fasta header and store fasta body
  fasta_header = paste0(">", file_name)
  fasta_body = lines[start]
  
  # write out
  cat(fasta_header, file = output, sep = "\n")
  cat(fasta_body, file = output, append = TRUE)
}

Use it like this:

seq_files = list.files(pattern = "*.seq$")

for (files in seq_files) {
  convert_seq_to_fasta(files)
}

This assumes the .seq files are in the same directory as the script (so save it first).

If your .seq files have this format, assuming the file name is rando.seq:

ATGCGATCGGACTGACTAGCTACGTACG
ACATCCATCATTATTCTATCTATCTATC
ACTATTCATCTATCTTACTATCTTACTC
AATCATTTCATTA

And you want this output:

>rando
ATGCGATCGGACTGACTAGCTACGTACGACATCCATCATTATTCTATCTATCTATCACTATTCATCTATCTTACTATCTTACTCAATCATTTCATT

Which is header + sequence data on one line then you can use this function:

convert_odd_to_fasta = function(path){
  lines = readLines(path)
  file_name = gsub(".seq", "", path)
  output = paste0(file_name, ".fasta")
  fasta_header = paste0(">", file_name)
  fasta_body = paste0(lines, collapse = '')
  cat(fasta_header, file = output, sep = "\n")
  cat(fasta_body, file = output, append = TRUE)
  
}

Use it the same as above.

Hope that helps!

Upvotes: 1

Related Questions