random
random

Reputation: 194

Finding the substring with the most repeats in a dictionary with dna sequences

The substring has to be with 6 characters. The number I'm gettig is smaller than it should be.

first I've written code to get the sequences from a file, then put them in a dictionary, then written 3 nested for loops: the first iterates over the dictionary and gets a sequence in each iteration. The second takes each sequence and gets a substring with 6 characters from it. In each iteration, the second loop increases the index of the start of the string (the long sequence) by 1. The third loop takes each substring from the second loop, and counts how many times it appears in each string (long sequence).

I tried rewriting the code many times. I think I got very close. I checked if the loops actually do their iterations, and they do. I even checked manually to see if the counts for a substring in random sequences are the same as the program gives, and they are. Any idea? maybe a different approach? what debugger do you use for Python?

I added a file with 3 shortened sequences for testing. Maybe try smaller substring: say with 3 characters instead of 6: rep_len = 3

The code

matches = []
count = 0
final_count = 0
rep_len = 6
repeat = ''
pos  = 0
seq_count = 0
seqs = {}
f = open(r"file.fasta")
# inserting each sequences from the file into a dictionary
for line in f:
    line = line.rstrip()
    if line[0] == '>':
        seq_count += 1
        name = seq_count
        seqs[name] = ''
    else:
        seqs[name] += line
for key, seq in seqs.items():  # getting one sequence in each iteration
    for pos in range(len(seq)):  # setting an index and increasing it by 1 in each iteration
        if pos <= len(seq) - rep_len: # checking no substring from the end of the sequence are selected
            repeat = seq[pos:pos + rep_len] # setting a substring
            if repeat not in matches: # checking if the substring was already scanned
                matches.append(repeat) # adding the substring to previously checked substrings' list
                for key1, seq2 in seqs.items(): # iterating over each sequence
                    count += seq2.count(repeat) # counting the substring's repetitions
                if count > final_count: # if the count is greater than the previously saved greatest number
                    final_count = count # the new value is saved
                count = 0    
print('repetitions: ', final_count) # printing

sequences.fasta

Upvotes: 0

Views: 819

Answers (2)

bug_spray
bug_spray

Reputation: 1516

It might be easier to use Counter from the collections module in Python. Also check out the NLTK library.

An example:

from collections import Counter
from nltk.util import ngrams

sequence = "cggttgcaatgagcgtcttgcacggaccgtcatgtaagaccgctacgcttcgatcaacgctattacgcaagccaccgaatgcccggctcgtcccaacctg"

def reps(substr): 
    "Counts repeats in a substring"
    return sum([i for i in Counter(substr).values() if i>1])


def make_grams(sent, n=6):
    "splits a sentence into n-grams"
    return ["".join(seq) for seq in (ngrams(sent,n))]

grams = make_grams(sequence) # splits string into substrings
max_length = max(list(map(reps, grams))) # gets maximum repeat count

result = [dna for dna in grams if reps(dna) == max_length]
print(result)

Output: ['gcgtct', 'cacgga', 'acggac', 'tgtaag', 'agaccg', 'gcttcg', 'cgcaag', 'gcaagc', 'gcccgg', 'cccggc', 'gctcgt', 'cccaac', 'ccaacc']

And if the question is look for the string with the most repeated character:

repeat_count = [max(Counter(a).values()) for a in result] # highest character repeat count
result_dict = {dna:ct for (dna,ct) in zip(result, repeat_count)}

another_result = [dna for dna in result_dict.keys() if result_dict[dna] == max(repeat_count)]
print(another_result)

Output: ['cccggc', 'cccaac', 'ccaacc']

Upvotes: 1

Tawy
Tawy

Reputation: 629

The code is not very clear, so it is a bit difficult to debug. I suggest rewriting.

Anyway, I (currently) just noted one small mistake:

        if pos < len(seq) - rep_len:

Should be

        if pos <= len(seq) - rep_len:

Currently, the last character in each sequence is ignored.

EDIT:

Here some rewriting of your code that is clearer and might help you investigate the errors:

rep_len = 6
seq_count = 0
seqs = {}
filename = "dna2.txt"

# Extract the data into a dictionary
with open(filename, "r") as f:
    for line in f:
        line = line.rstrip()
        if line[0] == '>':
            seq_count += 1
            name = seq_count
            seqs[name] = ''
        else:
            seqs[name] += line

# Store all the information, so that you can reuse it later
counter = {}

for key, seq in seqs.items():
    for pos in range(len(seq)-rep_len):
        repeat = seq[pos:pos + rep_len]
        if repeat in counter:
            counter[repeat] += 1
        else:
            counter[repeat] = 1

# Sort the counter to have max occurrences first
sorted_counter = sorted(counter.items(), key = lambda item:item[1], reverse=True  )

# Display the 5 max occurrences
for i in range(5):
    key, rep = sorted_counter[i]
    print("{} -> {}".format(key, rep))

# GCGCGC -> 11
# CCGCCG -> 11
# CGCCGA -> 10
# CGCGCG -> 9
# CGTCGA -> 9

Upvotes: 2

Related Questions