MiguelP
MiguelP

Reputation: 502

how to count the biggest consecutive occurences of substring in string?

I'm doing an exercise (cs50 - DNA) where I have to count specific consecutive substrings (STRS) mimicking DNA sequences, I'm finding myself overcomplicating my code and I'm having a hard time figuring out how to proceed.

I have a list of substrings:

strs = ['AGATC', 'AATG', 'TATC']

And a String with a random sequence of letters:

AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG

I want to count the biggest consecutive substrings that match each strs.

So:

resulting in [4, 1, 5]

(Note that this isn't the best example since there are no random repeating patterns scatered around but I think it illustrates what I'm looking for)

I know that I should be something of the likes of re.match(rf"({strs}){2,}", string) because str.count(strs) will give me ALL consecutive and non consecutive items.

My code so far:

#!/usr/bin/env python3
import csv
import sys
from cs50 import get_string

# sys.exit to terminate the program
# sys.exit(2) UNIX default for wrong args
if len(sys.argv) != 3:
    print("Usage: python dna.py data.csv sequence.txt")
    sys.exit(2)

# open file, make it into a list, get STRS, remove header
with open(sys.argv[1], "r") as database:
    data = list(csv.reader(database))
    STRS = data[0]
    data.pop(0)

# remove "name" so only thing remaining are STRs
STRS.pop(0)

# open file to compare agaist db
with open(sys.argv[2], "r") as seq:
    sequence = seq.read()

sequenceCount = []

# for each STR count the occurences
# sequence.count(s) returns all
for s in STRS:
    sequenceCount.append(sequence.count(s))

print(STRS)
print(sequenceCount)

"""
sequenceCount = {}

# for each STR count the occurences
for s in STRS:
    sequenceCount[s] = sequence.count(s)

for line in data:
    print(line)
    for item in line[1:]:
        continue


# rf"({STRS}){2,}"
"""

Upvotes: 2

Views: 587

Answers (2)

Omar AlSuwaidi
Omar AlSuwaidi

Reputation: 1404

Here using two for loops; one to grab each string (sequence) from strs, and the other to iterate over our dna strand to match each string from strs against it, and a while loop is used if a match was found to keep looking for consecutive (back2back) matches. (Added inline comments to give brief explanations on each step)

dna = 'AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATAGATCTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG'
strs = ['AGATC', 'AATG', 'TATC']


def seq_finder(sequence, dna):
    start = 0  # Will allow us to skip scanned sequences
    counter = [0] * len(sequence)  # Create a list of zeros to store sequence occurrences
    for idx, seq in enumerate(sequence):  # Iterate over every entry in our sequence "strs"
        k = len(seq)
        holder = 0  # A temporarily holder that will store #occurrences of *consecutive* sequences
        for i in range(start, len(dna)):  # For each sequence, iterate over our "dna" strand
            if dna[i:i+k] == strs[idx]:  # If match is found:
                holder += 1  # Increment our holder by 1
                while dna[i:i+k] == dna[i+k:i+k*2]:  # If our match has an identical match ahead (consecutively):
                    holder += 1  # Increment our holder by 1
                    i += k  # Start the next list indexing from our new match
                    start = i + 1  # To skip repetitive iterations over same matches
                if holder > counter[idx]:
                    counter[idx] = holder  # Only replace counter if new holder > old holder
                holder = 0  # Reset the holder when we existed our of our while loop (finished finding consecutives)
    return counter

Upvotes: 1

Kota Mori
Kota Mori

Reputation: 6740

Regular expression for finding repeating strings is like r"(AGATC)+".

For example,

import re

sequence = "AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG"
pattern = "AGATC"

r = re.search(r"({})+".format(pattern), sequence)

if r:
    print("start at", r.start())
    print("end at", r.end())

If a match is found, then you can access the starting and ending position by .start and .end methods. You can calculate the repetition using them.

If you need to find all matches in the sequence, then you can use re.finditer, which gives you match objects iteratively.

You can loop over target patterns and find the longest one.

Upvotes: 1

Related Questions