Reputation: 502
I'm doing an exercise (cs50 - DNA) where I have to count specific consecutive substrings (STRS) mimicking DNA sequences, I'm finding myself overcomplicating my code and I'm having a hard time figuring out how to proceed.
I have a list of substrings:
strs = ['AGATC', 'AATG', 'TATC']
And a String with a random sequence of letters:
AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
I want to count the biggest consecutive substrings that match each strs.
So:
'AGATC'
- AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
'AATG'
- AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
'TATC'
- AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
resulting in [4, 1, 5]
(Note that this isn't the best example since there are no random repeating patterns scatered around but I think it illustrates what I'm looking for)
I know that I should be something of the likes of re.match(rf"({strs}){2,}", string)
because str.count(strs)
will give me ALL consecutive and non consecutive items.
My code so far:
#!/usr/bin/env python3
import csv
import sys
from cs50 import get_string
# sys.exit to terminate the program
# sys.exit(2) UNIX default for wrong args
if len(sys.argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
sys.exit(2)
# open file, make it into a list, get STRS, remove header
with open(sys.argv[1], "r") as database:
data = list(csv.reader(database))
STRS = data[0]
data.pop(0)
# remove "name" so only thing remaining are STRs
STRS.pop(0)
# open file to compare agaist db
with open(sys.argv[2], "r") as seq:
sequence = seq.read()
sequenceCount = []
# for each STR count the occurences
# sequence.count(s) returns all
for s in STRS:
sequenceCount.append(sequence.count(s))
print(STRS)
print(sequenceCount)
"""
sequenceCount = {}
# for each STR count the occurences
for s in STRS:
sequenceCount[s] = sequence.count(s)
for line in data:
print(line)
for item in line[1:]:
continue
# rf"({STRS}){2,}"
"""
Upvotes: 2
Views: 587
Reputation: 1404
Here using two for loops; one to grab each string (sequence) from strs
, and the other to iterate over our dna
strand to match each string from strs
against it, and a while loop is used if a match was found to keep looking for consecutive (back2back) matches. (Added inline comments to give brief explanations on each step)
dna = 'AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATAGATCTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG'
strs = ['AGATC', 'AATG', 'TATC']
def seq_finder(sequence, dna):
start = 0 # Will allow us to skip scanned sequences
counter = [0] * len(sequence) # Create a list of zeros to store sequence occurrences
for idx, seq in enumerate(sequence): # Iterate over every entry in our sequence "strs"
k = len(seq)
holder = 0 # A temporarily holder that will store #occurrences of *consecutive* sequences
for i in range(start, len(dna)): # For each sequence, iterate over our "dna" strand
if dna[i:i+k] == strs[idx]: # If match is found:
holder += 1 # Increment our holder by 1
while dna[i:i+k] == dna[i+k:i+k*2]: # If our match has an identical match ahead (consecutively):
holder += 1 # Increment our holder by 1
i += k # Start the next list indexing from our new match
start = i + 1 # To skip repetitive iterations over same matches
if holder > counter[idx]:
counter[idx] = holder # Only replace counter if new holder > old holder
holder = 0 # Reset the holder when we existed our of our while loop (finished finding consecutives)
return counter
Upvotes: 1
Reputation: 6740
Regular expression for finding repeating strings is like r"(AGATC)+"
.
For example,
import re
sequence = "AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG"
pattern = "AGATC"
r = re.search(r"({})+".format(pattern), sequence)
if r:
print("start at", r.start())
print("end at", r.end())
If a match is found, then you can access the starting and ending position by .start
and .end
methods. You can calculate the repetition using them.
If you need to find all matches in the sequence, then you can use re.finditer
, which gives you match objects iteratively.
You can loop over target patterns and find the longest one.
Upvotes: 1