Python; print filename and header

Question

I have files (fasta files with a sequence) that look like this:

File1.fasta

>1
GTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTT
CTGATACCTCCAGCAACCCTCACAGGCCACCTTCGCAGGCTTACAGAACGCTCCCCTACC
CAACAACGCATAAACGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTT
CCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTTAAATGATGGCTGCTTCTA
AGCCAACATCCTGGCTGTCTGG
>2
AAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAA
ACCATGCACCGAAGCTGCGGCAGCGACACTCAGGTGTTGTTGGGTAGGGGAGCGTTCTGT
AAGCCTGTGAAGGTGGCCTGTGAGGGTTGCTGGAGGTATCAGAAGTGCGAATGCTGACAT
AAGTAACGATAAAGCGGGTGAAAAGCCCGCTCGCCGGAAGACCAAGGGTTCCTGTCCAAC
GTTAATCGGGGCAGG

File2.fasta

>1
CAACAACGCATAAACGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTT
>2
CCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTTAAATGATGGCTGCTTCTA

With my script, I count all the 5-mers in these files. My code is as follows:

import operator
import glob

def printSeq(name, seq):
    kmers = {}
    k = 5
    for i in range(len(seq) - k + 1):
       kmer = seq[i:i+k]
       if kmer in kmers:
          kmers[kmer] += 1
       else:
          kmers[kmer] = 1

    for kmer, count in kmers.items():
       print (kmer + "	" + str(count))

    sortedKmer = sorted(kmers.items(), reverse=True)

    for item in sortedKmer:
       print (item[0] + "	" + str(item[1]))


for name in glob.glob('*.fasta'):
    with open(name, 'r') as f:
        seq = ""
        key = ""
        for line in f.readlines():
            if line.startswith(">"):
                if key and seq:
                    printSeq(key, seq)
                key = line[1:].strip()
                seq = ""
            else:
                seq += line.strip()
        printSeq(key, seq)

The output is now the 5-mer followed with the count.

I want to adjust my output so that for each output line I get the filename followed by the header and than the count, like this:

File1 1 GTCTT 1
File1 1 TCTTC 1
File1 1 CTTCC 1
....
File2 2 TTCTA 1

How can I achieve that?

Additional question I want to add the reverse complement sequence of the data and count that together with the previous data. My code to get the reverse complement is as follows

from Bio import SeqIO
for fasta_file in glob.glob('*.fasta'):
    for record in SeqIO.parse(fasta_file, "fasta"):
        reverse_complement = ">" + record.id + "
" + record.seq.reverse_complement()

So the "reverse_complement" of file one, header >1 has to be counted together with the previous one etc. How can I include this data to my previous files and count together?

My reverse_complement data is

File1.fasta (reverse_complement)

>1
CCAGACAGCCAGGATGTTGGCTTAGAAGCAGCCATCATTTAAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAAACCATGCACCGAAGCTGCGGCAGCGACGTTTATGCGTTGTTGGGTAGGGGAGCGTTCTGTAAGCCTGCGAAGGTGGCCTGTGAGGGTTGCTGGAGGTATCAGAAGTGCGAATGCTGACATAAGTAACGATAAAGCGGGTGAAAAGCCCGCTCGCCGGAAGAC
>2
CCTGCCCCGATTAACGTTGGACAGGAACCCTTGGTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTTCTGATACCTCCAGCAACCCTCACAGGCCACCTTCACAGGCTTACAGAACGCTCCCCTACCCAACAACACCTGAGTGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTTCCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTT

Martin Evans · Accepted Answer

This could also be done using a Counter() as follows:

from collections import Counter
from itertools import groupby
import glob

for fasta_file in glob.glob('*.fasta'):
    basename = os.path.splitext(os.path.basename(fasta_file))[0]

    with open(fasta_file) as f_fasta:
        for k, g in groupby(f_fasta, lambda x: x.startswith('>')):
            if k:
                sequence = next(g).strip('>
')
            else:
                d = list(''.join(line.strip() for line in g))
                counts = Counter()

                while len(d) >= 5:
                    five_mer = '{}{}{}{}{}'.format(d[0], d[1], d[2], d[3], d[4])
                    counts[five_mer] += 1
                    del d[0]

                for five_mer, count in sorted(counts.items(), key=lambda x: (-x[1], x[0])):
                    print "{} {} {} {}".format(basename, sequence, five_mer, count)

This would give you output with the largest counts first and then alphabetically:

File1 1 CAGGC 3
File1 1 CGCAG 3
File1 1 GCTTT 3
File1 1 AACGC 2
File1 1 ACATC 2
File1 1 ACGCT 2
File1 1 AGGCC 2

It uses Python's groupby() function to read groups of lines together. It either reads a single sequence line or a list of five mer lines. k is the result of the startswith() call. So when k is False, take all the lines returned, remove the newline from each and then join them together to make a single line of characters.

It then reads the first 5 characters from the list, joins them back together and adds them as a key to a Counter(). It then removes the first character from the list and repeats until there are less than 5 characters remaining.

For just alphabetical ordering:

for five_mer, count in sorted(counts.items()):

A Counter() works the same way as a dictionary, so .items() would give a list of key value pairs. These are sorted before being displayed.

Python; print filename and header

Answers (2)

Related Questions