How to convert multiline fasta files to singleline fasta files without biopython

Question

I have several large fasta files where the sequence is kept in multiple lines.

>header1
AGTCGTAGCTACGTACGTACGTGTACGTACGTA
TGACGTACGTAGCTGCATGCTA
>header2
TGCAGATCGTAGTCGATGCTAGTGCATGCATGT
ACGTAGTGCAG
...

And I want to transform this into fasta files where the sequences are combined into one line.

>header1
AGTCGTAGCTACGTACGTACGTGTACGTACGTATGACGTACGTAGCTGCATGCTA
>header2
TGCAGATCGTAGTCGATGCTAGTGCATGCATGTACGTAGTGCAG
...

My fasta files are very large, so I need a memory-efficient method (because the sequence files are larger than my memory). Therefore I cannot use Biopython (below there is a solution for my problem with Biopython in case this is helpful for anyone, this is from Biostars).

from Bio import SeqIO
import re

def multi2linefasta(indir,outdir,filelist):
    for items in filelist:
        mfasta = outdir +"/"+re.sub('\..*','',items)+'_twoline.fasta'
        ifile = open(indir+'/'+items,'rU')
        with open(mfasta, 'w') as ofile:
            for record in SeqIO.parse(ifile, "fasta"):
                sequence = str(record.seq)
                ofile.write('>'+record.id+'
'+sequence+'
')

Martin Evans · Accepted Answer

The following will process your file a line at a time:

with open('input.fasta') as f_input, open('output.fasta', 'w') as f_output:
    block = []

    for line in f_input:
        if line.startswith('>header'):
            if block:
                f_output.write(''.join(block) + '
')
                block = []
            f_output.write(line)
        else:
            block.append(line.strip())

    if block:
        f_output.write(''.join(block) + '
')

Giving you an output.fasta containing:

>header1
AGTCGTAGCTACGTACGTACGTGTACGTACGTATGACGTACGTAGCTGCATGCTA
>header2
TGCAGATCGTAGTCGATGCTAGTGCATGCATGTACGTAGTGCAG

How to convert multiline fasta files to singleline fasta files without biopython

Answers (2)

Related Questions