Revan
Revan

Reputation: 2322

How to convert multiline fasta files to singleline fasta files without biopython

I have several large fasta files where the sequence is kept in multiple lines.

>header1
AGTCGTAGCTACGTACGTACGTGTACGTACGTA
TGACGTACGTAGCTGCATGCTA
>header2
TGCAGATCGTAGTCGATGCTAGTGCATGCATGT
ACGTAGTGCAG
...

And I want to transform this into fasta files where the sequences are combined into one line.

>header1
AGTCGTAGCTACGTACGTACGTGTACGTACGTATGACGTACGTAGCTGCATGCTA
>header2
TGCAGATCGTAGTCGATGCTAGTGCATGCATGTACGTAGTGCAG
...

My fasta files are very large, so I need a memory-efficient method (because the sequence files are larger than my memory). Therefore I cannot use Biopython (below there is a solution for my problem with Biopython in case this is helpful for anyone, this is from Biostars).

from Bio import SeqIO
import re

def multi2linefasta(indir,outdir,filelist):
    for items in filelist:
        mfasta = outdir +"/"+re.sub('\..*','',items)+'_twoline.fasta'
        ifile = open(indir+'/'+items,'rU')
        with open(mfasta, 'w') as ofile:
            for record in SeqIO.parse(ifile, "fasta"):
                sequence = str(record.seq)
                ofile.write('>'+record.id+'\n'+sequence+'\n')

Upvotes: 3

Views: 4402

Answers (2)

Imen Ayadi
Imen Ayadi

Reputation: 11

using the following answer, all the ID and Sequence will be in the same line and i would like to obtain the header in a line and all the lines for the sequences in another line.

with open('input.fasta') as f_input, open('output.fasta', 'w') as f_output:
    block = []

    for line in f_input:
        if line.startswith('>header'):
            if block:
                f_output.write(''.join(block) + '\n')
                block = []
            f_output.write(line)
        else:
            block.append(line.strip())

    if block:
        f_output.write(''.join(block) + '\n')

Upvotes: 1

Martin Evans
Martin Evans

Reputation: 46759

The following will process your file a line at a time:

with open('input.fasta') as f_input, open('output.fasta', 'w') as f_output:
    block = []

    for line in f_input:
        if line.startswith('>header'):
            if block:
                f_output.write(''.join(block) + '\n')
                block = []
            f_output.write(line)
        else:
            block.append(line.strip())

    if block:
        f_output.write(''.join(block) + '\n')

Giving you an output.fasta containing:

>header1
AGTCGTAGCTACGTACGTACGTGTACGTACGTATGACGTACGTAGCTGCATGCTA
>header2
TGCAGATCGTAGTCGATGCTAGTGCATGCATGTACGTAGTGCAG

Upvotes: 4

Related Questions