JTrew
JTrew

Reputation: 23

Drawing multiple sequences from 1 file, based on shared fields in another file

I'm trying to run a python script to draw sequences from a separate file (merged.fas), in respect to a list (gene_fams_eggnog.txt) I have as output from another program.

The code is as follows:

from Bio import SeqIO
import os, sys, re
from collections import defaultdict
sequences = "merged.fas"
all_seqs = SeqIO.index(sequences, "fasta")
gene_fams = defaultdict(list)

gene_fams_file = open("gene_fams_eggnog.txt")
for line in gene_fams_file:
     fields = re.split("\t", line.rstrip())
     gene_fams[fields[0]].append[fields[1]]

for fam in gene_fams.keys():
    output_filename = str(fam) + ".fasta"
    outh = open(output_filename, "w")
    for id in gene_fams[fam]:
        if id in all_seqs:
            outh.write(">" + all_seqs[id].description + "\n" + str(all_seqs[id].seq) + "\n")
        else:
            print "Uh oh! Sequence with ID " + str(id) + " is not in the all_seqs file!"
            quit()
    outh.close()

The list looks like this:

1 Saccharomycescerevisiae_DAA09367.1
1 bieneu_EED42827.1
1 Asp_XP_749186.1
1 Mag_XP_003717339.1
2 Mag_XP_003716586.1
2 Mag_XP_003709453.1
3 Asp_XP_749329.1

The field 0 denotes a grouping based by a similarity between the sequences. The script was meant to take all the sequences from merged.fas that correspond to the code in the field 1 and write them into a file base on field 0.

So in the case of the portion of the list I have shown, all the sequences that have a 1 in field 0 (Saccharomycescerevisiae_DAA09367.1, bieneu_EED42827.1, Asp_XP_749186.1, Mag_XP_003717339.1) would have been written into a file called 1.fasta. This should continue from 2.fasta-however many groups there are.

So this has worked, however it doesn't include all the sequences that are in the group, it'll only include the last one to be listed as a part of that group. Using my example above, I'd only have a file (1.fasta) with one sequence (Mag_XP_003717339.1), instead of all four.

Any and all help is appreciated, Thanks, JT

Upvotes: 0

Views: 97

Answers (1)

cdlane
cdlane

Reputation: 41872

Although I didn't spot the cause of the issue you complained about, I'm surprised your code runs at all with this error:

gene_fams[fields[0]].append[fields[1]]

i.e. append[...] instead of append(...). But perhaps that's also, "not there in the actual script I'm running". I rewrote your script below, and it works fine for me. One issue was your use of the variable name id which is a Python builtin. You'll see I go to an extreme to avoid such errors:

from Bio import SeqIO
from collections import defaultdict

SEQUENCE_FILE_NAME = "merged.fas"
FAMILY_FILE_NAME = "gene_families_eggnog.txt"

all_sequences = SeqIO.index(SEQUENCE_FILE_NAME, "fasta")
gene_families = defaultdict(list)

with open(FAMILY_FILE_NAME) as gene_families_file:
    for line in gene_families_file:
        family_id, gene_id = line.rstrip().split()
        gene_families[family_id].append(gene_id)

for family_id, gene_ids in gene_families.items():
    output_filename = family_id + ".fasta"

    with open(output_filename, "w") as output:
        for gene_id in gene_ids:
            assert gene_id in all_sequences, "Sequence {} is not in {}!".format(gene_id, SEQUENCE_FILE_NAME)

            output.write(all_sequences[gene_id].format("fasta"))

Upvotes: 1

Related Questions