Gravel
Gravel

Reputation: 465

Python; print filename and header

I have files (fasta files with a sequence) that look like this:

File1.fasta

>1
GTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTT
CTGATACCTCCAGCAACCCTCACAGGCCACCTTCGCAGGCTTACAGAACGCTCCCCTACC
CAACAACGCATAAACGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTT
CCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTTAAATGATGGCTGCTTCTA
AGCCAACATCCTGGCTGTCTGG
>2
AAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAA
ACCATGCACCGAAGCTGCGGCAGCGACACTCAGGTGTTGTTGGGTAGGGGAGCGTTCTGT
AAGCCTGTGAAGGTGGCCTGTGAGGGTTGCTGGAGGTATCAGAAGTGCGAATGCTGACAT
AAGTAACGATAAAGCGGGTGAAAAGCCCGCTCGCCGGAAGACCAAGGGTTCCTGTCCAAC
GTTAATCGGGGCAGG

File2.fasta

>1
CAACAACGCATAAACGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTT
>2
CCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTTAAATGATGGCTGCTTCTA

With my script, I count all the 5-mers in these files. My code is as follows:

import operator
import glob

def printSeq(name, seq):
    kmers = {}
    k = 5
    for i in range(len(seq) - k + 1):
       kmer = seq[i:i+k]
       if kmer in kmers:
          kmers[kmer] += 1
       else:
          kmers[kmer] = 1

    for kmer, count in kmers.items():
       print (kmer + "\t" + str(count))

    sortedKmer = sorted(kmers.items(), reverse=True)

    for item in sortedKmer:
       print (item[0] + "\t" + str(item[1]))


for name in glob.glob('*.fasta'):
    with open(name, 'r') as f:
        seq = ""
        key = ""
        for line in f.readlines():
            if line.startswith(">"):
                if key and seq:
                    printSeq(key, seq)
                key = line[1:].strip()
                seq = ""
            else:
                seq += line.strip()
        printSeq(key, seq)

The output is now the 5-mer followed with the count.

I want to adjust my output so that for each output line I get the filename followed by the header and than the count, like this:

File1 1 GTCTT 1
File1 1 TCTTC 1
File1 1 CTTCC 1
....
File2 2 TTCTA 1

How can I achieve that?

Additional question I want to add the reverse complement sequence of the data and count that together with the previous data. My code to get the reverse complement is as follows

from Bio import SeqIO
for fasta_file in glob.glob('*.fasta'):
    for record in SeqIO.parse(fasta_file, "fasta"):
        reverse_complement = ">" + record.id + "\n" + record.seq.reverse_complement()

So the "reverse_complement" of file one, header >1 has to be counted together with the previous one etc. How can I include this data to my previous files and count together?

My reverse_complement data is

File1.fasta (reverse_complement)

>1
CCAGACAGCCAGGATGTTGGCTTAGAAGCAGCCATCATTTAAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAAACCATGCACCGAAGCTGCGGCAGCGACGTTTATGCGTTGTTGGGTAGGGGAGCGTTCTGTAAGCCTGCGAAGGTGGCCTGTGAGGGTTGCTGGAGGTATCAGAAGTGCGAATGCTGACATAAGTAACGATAAAGCGGGTGAAAAGCCCGCTCGCCGGAAGAC
>2
CCTGCCCCGATTAACGTTGGACAGGAACCCTTGGTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTTCTGATACCTCCAGCAACCCTCACAGGCCACCTTCACAGGCTTACAGAACGCTCCCCTACCCAACAACACCTGAGTGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTTCCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTT

Upvotes: 0

Views: 1945

Answers (2)

Martin Evans
Martin Evans

Reputation: 46759

This could also be done using a Counter() as follows:

from collections import Counter
from itertools import groupby
import glob

for fasta_file in glob.glob('*.fasta'):
    basename = os.path.splitext(os.path.basename(fasta_file))[0]

    with open(fasta_file) as f_fasta:
        for k, g in groupby(f_fasta, lambda x: x.startswith('>')):
            if k:
                sequence = next(g).strip('>\n')
            else:
                d = list(''.join(line.strip() for line in g))
                counts = Counter()

                while len(d) >= 5:
                    five_mer = '{}{}{}{}{}'.format(d[0], d[1], d[2], d[3], d[4])
                    counts[five_mer] += 1
                    del d[0]

                for five_mer, count in sorted(counts.items(), key=lambda x: (-x[1], x[0])):
                    print "{} {} {} {}".format(basename, sequence, five_mer, count)

This would give you output with the largest counts first and then alphabetically:

File1 1 CAGGC 3
File1 1 CGCAG 3
File1 1 GCTTT 3
File1 1 AACGC 2
File1 1 ACATC 2
File1 1 ACGCT 2
File1 1 AGGCC 2

It uses Python's groupby() function to read groups of lines together. It either reads a single sequence line or a list of five mer lines. k is the result of the startswith() call. So when k is False, take all the lines returned, remove the newline from each and then join them together to make a single line of characters.

It then reads the first 5 characters from the list, joins them back together and adds them as a key to a Counter(). It then removes the first character from the list and repeats until there are less than 5 characters remaining.

For just alphabetical ordering:

for five_mer, count in sorted(counts.items()):

A Counter() works the same way as a dictionary, so .items() would give a list of key value pairs. These are sorted before being displayed.

Upvotes: 1

derM
derM

Reputation: 13691

You change the signature of

def printSeq(name, seq)

to

def printSeq(file, header, name, seq):

incorporate the new variables in the print statements.
e.g.

print (item[0] + "\t" + str(item[1]))
v
print (file + "\t" + header + "\t" + item[0] + "\t" + str(item[1]))

Then, in your loop you pass the information to this function.

You have the file name available in the loop, stored in the variable name

You parse the header in the lines where you detect it, and store it in a variable for later use. The later use is when you call the printSeq-function

Upvotes: 0

Related Questions