Joining DNA sequences from two files under the same species name

Question

I have two FASTA file with DNA sequences coding for two different proteins. I want to join the sequences for the different proteins and same species into one long sequence.

for example, I have:

Protein 1
>sce
AGTAGATGACAGCT
>act
GCTAGCTAGCT

Protein 2
>sce
GCTACGATCGACT
>act
TACGATCAGCTA

Protein 1+2
>sce
AGTAGATGACAGCTGCTACGATCGACT
>act
GCTAGCTAGCTTACGATCAGCTA

Something that might be a bit of an issue is that the species don't appear in the same order in both files and there's a few sequences that are found in one, but not in the other (files are about 110-species long, with discrepancy of 4 or 5).

My first attempt at writing a code for it was:

gamma = open('gamma.fas', 'w')
spc = open("spc98.fas", 'w')
outfile = open("joined.fas", 'w')
for line in gamma:
    if line.startswith(">"):
        for line2 in spc:
             if line2.startswith(">"):
                if line == line2:
                    outfile.write(line)
    else:
        outfile.write(line)
fh.close()

but since the DNA sequences are very long and take many lines of the file, I don't know how to select them.

Please help!

Chris Charley · Accepted Answer

By using a dictionary, you could append fasta sequences to each ID. And then, print them to the output file.

outfile = open("joined.fas", 'w')

d = dict()

for file in ('gamma.fas', 'spc98.fas'):
    with open(file, 'r') as f:
        for line in f:
            line = line.rstrip()
            if line.startswith('>'):
                key = line
            else:
                d.setdefault(key, '')
                d[key] += line

for key, seq in d.items():
    outfile.write(key + "
" + seq + "
")

outfile.close()

EDIT: By the way, you are opening your two reading files as open for writing which will clobber the two input files.

gamma = open('gamma.fas', 'w') spc = open("spc98.fas", 'w')

They should be opened with r instead of w.

Joining DNA sequences from two files under the same species name

Answers (2)

Related Questions