Reputation: 25
When using the regex .search() I found that it matches only the first time a pattern occurs in a string, and to find all the recurrence of that pattern in the string .findall() is needed.
So, my question is: giving two different strings that "talks" to each other, i need to find each occurrences of a specific pattern in a string, then grab the position of this pattern and take the elements in that positions from the first string, then print them or save in a new list.
To be more clear i'll provide an example:
ACGCUGAGAGGACGAUGCGGACGUGCUUAGGACGUUCACACGGUGGAAGUUCACAACAAGCAGACGACUCGCUGAGGAUCCGAGAUUGCUCGCGAUCGG
...((.((....(((..((....(((((.((((.(((((...))))).)))).....)))))..))..))))).))((((((((....)))).))))..
These are the two strings, first with letters, second with dots and brackets. The pattern I want to find, compiled by regex is "((.+))". Once the pattern is found on the second string, then grab the position of the pattern and return the correspective elements of string number one. With these input i'd expect 2 different output: CACGG and GAUUGC.
To date the code i have written is like: for line in file:
if (line[0] == "A") or (line[0] == "C") or (line[0] == "T") or (line[0] == "G"):
apt.append(line)
count = count + 1
else:
line = line.strip()
pattern = "(\(\.+\))"
match = re.search(pattern, line)
if match:
loop.append(apt[count][match.start():match.end()])
else:
continue
This obviously retrieves only the first match of the pattern that occurs in the second line of the file, giving only CACGG as output.
How can I modify the code in order to retrieve also the second occurrence of the pattern?
thankyou, any help appreciated
Upvotes: 1
Views: 2315
Reputation: 71538
If you don't mind using re.finditer
:
>>> import re
>>> str1 = "ACGCUGAGAGGACGAUGCGGACGUGCUUAGGACGUUCACACGGUGGAAGUUCACAACAAGCAGACGACUCGCUGAGGAUCCGAGAUUGCUCGCGAUCGG"
>>> str2 = "...((.((....(((..((....(((((.((((.(((((...))))).)))).....)))))..))..))))).))((((((((....)))).)))).."
>>> pat = re.compile(r"\([^()]+\)")
>>> for m in pat.finditer(str2):
... print '%02d-%02d: %s' % (m.start(), m.end(), m.group())
... print str1[m.start():m.end()]
38-43: (...)
CACGG
83-89: (....)
GAUUGC
The regex \([^()]+\)
gets the part in parentheses that doesn't have any more parentheses inside. [^()]
by the way is a negated class that doesn't match any parentheses.
You could also use the pattern: \(\.+\)
by the way.
In your case, it could be something like:
if (line[0] == "A") or (line[0] == "C") or (line[0] == "T") or (line[0] == "G"):
apt.append(line)
count = count + 1
else:
line = line.strip()
pattern = r"\(\.+\)"
for match in pattern.finditer(line):
loop.append(apt[count][match.start():match.end()])
It will be faster if you compile the pattern before reading the file.
I cannot test this code, but here, keep in mind that each piece found will be appended to loop
.
Upvotes: 3