Reputation: 465
I have files (fasta files with a sequence) that look like this:
File1.fasta
>1
GTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTT
CTGATACCTCCAGCAACCCTCACAGGCCACCTTCGCAGGCTTACAGAACGCTCCCCTACC
CAACAACGCATAAACGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTT
CCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTTAAATGATGGCTGCTTCTA
AGCCAACATCCTGGCTGTCTGG
>2
AAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAA
ACCATGCACCGAAGCTGCGGCAGCGACACTCAGGTGTTGTTGGGTAGGGGAGCGTTCTGT
AAGCCTGTGAAGGTGGCCTGTGAGGGTTGCTGGAGGTATCAGAAGTGCGAATGCTGACAT
AAGTAACGATAAAGCGGGTGAAAAGCCCGCTCGCCGGAAGACCAAGGGTTCCTGTCCAAC
GTTAATCGGGGCAGG
File2.fasta
>1
CAACAACGCATAAACGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTT
>2
CCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTTAAATGATGGCTGCTTCTA
With my script, I count all the 5-mers in these files. My code is as follows:
import operator
import glob
def printSeq(name, seq):
kmers = {}
k = 5
for i in range(len(seq) - k + 1):
kmer = seq[i:i+k]
if kmer in kmers:
kmers[kmer] += 1
else:
kmers[kmer] = 1
for kmer, count in kmers.items():
print (kmer + "\t" + str(count))
sortedKmer = sorted(kmers.items(), reverse=True)
for item in sortedKmer:
print (item[0] + "\t" + str(item[1]))
for name in glob.glob('*.fasta'):
with open(name, 'r') as f:
seq = ""
key = ""
for line in f.readlines():
if line.startswith(">"):
if key and seq:
printSeq(key, seq)
key = line[1:].strip()
seq = ""
else:
seq += line.strip()
printSeq(key, seq)
The output is now the 5-mer followed with the count.
I want to adjust my output so that for each output line I get the filename followed by the header and than the count, like this:
File1 1 GTCTT 1
File1 1 TCTTC 1
File1 1 CTTCC 1
....
File2 2 TTCTA 1
How can I achieve that?
Additional question I want to add the reverse complement sequence of the data and count that together with the previous data. My code to get the reverse complement is as follows
from Bio import SeqIO
for fasta_file in glob.glob('*.fasta'):
for record in SeqIO.parse(fasta_file, "fasta"):
reverse_complement = ">" + record.id + "\n" + record.seq.reverse_complement()
So the "reverse_complement" of file one, header >1 has to be counted together with the previous one etc. How can I include this data to my previous files and count together?
My reverse_complement data is
File1.fasta (reverse_complement)
>1
CCAGACAGCCAGGATGTTGGCTTAGAAGCAGCCATCATTTAAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAAACCATGCACCGAAGCTGCGGCAGCGACGTTTATGCGTTGTTGGGTAGGGGAGCGTTCTGTAAGCCTGCGAAGGTGGCCTGTGAGGGTTGCTGGAGGTATCAGAAGTGCGAATGCTGACATAAGTAACGATAAAGCGGGTGAAAAGCCCGCTCGCCGGAAGAC
>2
CCTGCCCCGATTAACGTTGGACAGGAACCCTTGGTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTTCTGATACCTCCAGCAACCCTCACAGGCCACCTTCACAGGCTTACAGAACGCTCCCCTACCCAACAACACCTGAGTGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTTCCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTT
Upvotes: 0
Views: 1945
Reputation: 46759
This could also be done using a Counter()
as follows:
from collections import Counter
from itertools import groupby
import glob
for fasta_file in glob.glob('*.fasta'):
basename = os.path.splitext(os.path.basename(fasta_file))[0]
with open(fasta_file) as f_fasta:
for k, g in groupby(f_fasta, lambda x: x.startswith('>')):
if k:
sequence = next(g).strip('>\n')
else:
d = list(''.join(line.strip() for line in g))
counts = Counter()
while len(d) >= 5:
five_mer = '{}{}{}{}{}'.format(d[0], d[1], d[2], d[3], d[4])
counts[five_mer] += 1
del d[0]
for five_mer, count in sorted(counts.items(), key=lambda x: (-x[1], x[0])):
print "{} {} {} {}".format(basename, sequence, five_mer, count)
This would give you output with the largest counts first and then alphabetically:
File1 1 CAGGC 3
File1 1 CGCAG 3
File1 1 GCTTT 3
File1 1 AACGC 2
File1 1 ACATC 2
File1 1 ACGCT 2
File1 1 AGGCC 2
It uses Python's groupby()
function to read groups of lines together. It either reads a single sequence line or a list of five mer lines. k
is the result of the startswith()
call. So when k
is False
, take all the lines returned, remove the newline from each and then join them together to make a single line of characters.
It then reads the first 5 characters from the list, joins them back together and adds them as a key to a Counter()
. It then removes the first character from the list and repeats until there are less than 5 characters remaining.
For just alphabetical ordering:
for five_mer, count in sorted(counts.items()):
A Counter()
works the same way as a dictionary, so .items()
would give a list of key value pairs. These are sorted before being displayed.
Upvotes: 1
Reputation: 13691
You change the signature of
def printSeq(name, seq)
to
def printSeq(file, header, name, seq):
incorporate the new variables in the print
statements.
e.g.
print (item[0] + "\t" + str(item[1]))
v
print (file + "\t" + header + "\t" + item[0] + "\t" + str(item[1]))
Then, in your loop you pass the information to this function.
You have the file name available in the loop, stored in the variable name
You parse the header in the lines where you detect it, and store it in a variable for later use. The later use is when you call the printSeq
-function
Upvotes: 0