how to count the biggest consecutive occurences of substring in string?

Question

I'm doing an exercise (cs50 - DNA) where I have to count specific consecutive substrings (STRS) mimicking DNA sequences, I'm finding myself overcomplicating my code and I'm having a hard time figuring out how to proceed.

I have a list of substrings:

strs = ['AGATC', 'AATG', 'TATC']

And a String with a random sequence of letters:

AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG

I want to count the biggest consecutive substrings that match each strs.

So:

'AGATC' - AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
'AATG' - AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
'TATC' - AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG

resulting in [4, 1, 5]

(Note that this isn't the best example since there are no random repeating patterns scatered around but I think it illustrates what I'm looking for)

I know that I should be something of the likes of re.match(rf"({strs}){2,}", string) because str.count(strs) will give me ALL consecutive and non consecutive items.

My code so far:

#!/usr/bin/env python3
import csv
import sys
from cs50 import get_string

# sys.exit to terminate the program
# sys.exit(2) UNIX default for wrong args
if len(sys.argv) != 3:
    print("Usage: python dna.py data.csv sequence.txt")
    sys.exit(2)

# open file, make it into a list, get STRS, remove header
with open(sys.argv[1], "r") as database:
    data = list(csv.reader(database))
    STRS = data[0]
    data.pop(0)

# remove "name" so only thing remaining are STRs
STRS.pop(0)

# open file to compare agaist db
with open(sys.argv[2], "r") as seq:
    sequence = seq.read()

sequenceCount = []

# for each STR count the occurences
# sequence.count(s) returns all
for s in STRS:
    sequenceCount.append(sequence.count(s))

print(STRS)
print(sequenceCount)

"""
sequenceCount = {}

# for each STR count the occurences
for s in STRS:
    sequenceCount[s] = sequence.count(s)

for line in data:
    print(line)
    for item in line[1:]:
        continue


# rf"({STRS}){2,}"
"""

Kota Mori · Accepted Answer

Regular expression for finding repeating strings is like r"(AGATC)+".

For example,

import re

sequence = "AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG"
pattern = "AGATC"

r = re.search(r"({})+".format(pattern), sequence)

if r:
    print("start at", r.start())
    print("end at", r.end())

If a match is found, then you can access the starting and ending position by .start and .end methods. You can calculate the repetition using them.

If you need to find all matches in the sequence, then you can use re.finditer, which gives you match objects iteratively.

You can loop over target patterns and find the longest one.

how to count the biggest consecutive occurences of substring in string?

Answers (2)

Related Questions