Reputation: 2322
I have several large fasta files where the sequence is kept in multiple lines.
>header1
AGTCGTAGCTACGTACGTACGTGTACGTACGTA
TGACGTACGTAGCTGCATGCTA
>header2
TGCAGATCGTAGTCGATGCTAGTGCATGCATGT
ACGTAGTGCAG
...
And I want to transform this into fasta files where the sequences are combined into one line.
>header1
AGTCGTAGCTACGTACGTACGTGTACGTACGTATGACGTACGTAGCTGCATGCTA
>header2
TGCAGATCGTAGTCGATGCTAGTGCATGCATGTACGTAGTGCAG
...
My fasta files are very large, so I need a memory-efficient method (because the sequence files are larger than my memory). Therefore I cannot use Biopython (below there is a solution for my problem with Biopython in case this is helpful for anyone, this is from Biostars).
from Bio import SeqIO
import re
def multi2linefasta(indir,outdir,filelist):
for items in filelist:
mfasta = outdir +"/"+re.sub('\..*','',items)+'_twoline.fasta'
ifile = open(indir+'/'+items,'rU')
with open(mfasta, 'w') as ofile:
for record in SeqIO.parse(ifile, "fasta"):
sequence = str(record.seq)
ofile.write('>'+record.id+'\n'+sequence+'\n')
Upvotes: 3
Views: 4402
Reputation: 11
using the following answer, all the ID and Sequence will be in the same line and i would like to obtain the header in a line and all the lines for the sequences in another line.
with open('input.fasta') as f_input, open('output.fasta', 'w') as f_output:
block = []
for line in f_input:
if line.startswith('>header'):
if block:
f_output.write(''.join(block) + '\n')
block = []
f_output.write(line)
else:
block.append(line.strip())
if block:
f_output.write(''.join(block) + '\n')
Upvotes: 1
Reputation: 46759
The following will process your file a line at a time:
with open('input.fasta') as f_input, open('output.fasta', 'w') as f_output:
block = []
for line in f_input:
if line.startswith('>header'):
if block:
f_output.write(''.join(block) + '\n')
block = []
f_output.write(line)
else:
block.append(line.strip())
if block:
f_output.write(''.join(block) + '\n')
Giving you an output.fasta
containing:
>header1
AGTCGTAGCTACGTACGTACGTGTACGTACGTATGACGTACGTAGCTGCATGCTA
>header2
TGCAGATCGTAGTCGATGCTAGTGCATGCATGTACGTAGTGCAG
Upvotes: 4