Reputation: 169563
Is there a way to match a pattern (e\d\d
) several times, capturing each one into a group? For example, given the string..
blah.s01e24e25
..I wish to get four groups:
1 -> blah
2 -> 01
3 -> 24
4 -> 25
The obvious regex to use is (in Python regex:
import re
re.match("(\w+).s(\d+)e(\d+)e(\d+)", "blah.s01e24e25").groups()
..but I also want to match either of the following:
blah.s01e24
blah.s01e24e25e26
You can't seem to do (e\d\d)+
, or rather you can, but it only captures the last occurrence:
>>> re.match("(\w+).s(\d+)(e\d\d){2}", "blah.s01e24e25e26").groups()
('blah', '01', 'e25')
>>> re.match("(\w+).s(\d+)(e\d\d){3}", "blah.s01e24e25e26").groups()
('blah', '01', 'e26')
I want to do this in a single regex because I have multiple patterns to match TV episode filenames, and do not want to duplicate each expression to handle multiple episodes:
\w+\.s(\d+)\.e(\d+) # matches blah.s01e01
\w+\.s(\d+)\.e(\d+)\.e(\d+) # matches blah.s01e01e02
\w+\.s(\d+)\.e(\d+)\.e(\d+)\.e(\d+) # matches blah.s01e01e02e03
\w - \d+x\d+ # matches blah - 01x01
\w - \d+x\d+\d+ # matches blah - 01x01x02
\w - \d+x\d+\d+\d+ # matches blah - 01x01x02x03
..and so on for numerous other patterns.
Another thing to complicate matters - I wish to store these regexs in a config file, so a solution using multiple regexs and function calls is not desired - but if this proves impossible I'll just allow the user to add simple regexs
Basically, is there a way to capture a repeating pattern using regex?
Upvotes: 4
Views: 2547
Reputation: 95921
Perhaps something like that?
def episode_matcher(filename):
m1= re.match(r"(?i)(.*?)\.s(\d+)((?:e\d+)+)", filename)
if m1:
m2= re.findall(r"\d+", m1.group(3))
return m1.group(1), m1.group(2), m2
# auto return None here
>>> episode_matcher("blah.s01e02")
('blah', '01', ['02'])
>>> episode_matcher("blah.S01e02E03")
('blah', '01', ['02', '03'])
Upvotes: 0
Reputation: 169563
After thinking about the problem, I think I have a simpler solution, using named groups.
The simplest regex a user (or I) could use is:
(\w+\).s(\d+)\.e(\d+)
The filename parsing class will take the first group as the show name, second as season number, third as episode number. This covers a majority of files.
I'll allow a few different named groups for these:
(?P<showname>\w+\).s(?P<seasonnumber>\d+)\.e(?P<episodenumber>\d+)
To support multiple episodes, I'll support two named groups, something like startingepisodenumber
and endingepisodenumber
to support things like showname.s01e01-03
:
(?P<showname>\w+\)\.s(?P<seasonnumber>\d+)\.e(?P<startingepisodenumber>\d+)-(?P<endingepisodenumber>e\d+)
And finally, allow named groups with names matching episodenumber\d+
(episodenumber1
, episodenumber2
etc):
(?P<showname>\w+\)\.
s(?P<seasonnumber>\d+)\.
e(?P<episodenumber1>\d+)
e(?P<episodenumber2>\d+)
e(?P<episodenumber3>\d+)
It still requires possibly duplicating the patterns for different amounts of e01
s, but there will never be a file with two non-consecutive episodes (like show.s01e01e03e04
), so using the starting/endingepisodenumber
groups should solve this, and for weird cases users come across, they can use the episodenumber\d+
group names
This doesn't really answer the sequence-of-patterns question, but it solves the problem that led me to ask it! (I'll still accept another answer that shows how to match s01e23e24...e27
in one regex - if someone works this out!)
Upvotes: 0
Reputation: 7343
non-grouping parentheses: (?:asdfasdg)
which do not have to appear: (?:adsfasdf)?
c = re.compile(r"""(\w+).s(\d+)
(?:
e(\d+)
(?:
e(\d+)
)?
)?
""", re.X)
or
c = re.compile(r"""(\w+).s(\d+)(?:e(\d+)(?:e(\d+))?)?""", re.X)
Upvotes: 1
Reputation: 8953
Number of captured groups equal to number of parentheses groups. Look at findall
or finditer
for solving your problem.
Upvotes: 1
Reputation: 281455
Do it in two steps, one to find all the numbers, then one to split them:
import re
def get_pieces(s):
# Error checking omitted!
whole_match = re.search(r'\w+\.(s\d+(?:e\d+)+)', s)
return re.findall(r'\d+', whole_match.group(1))
print get_pieces(r"blah.s01e01")
print get_pieces(r"blah.s01e01e02")
print get_pieces(r"blah.s01e01e02e03")
# prints:
# ['01', '01']
# ['01', '01', '02']
# ['01', '01', '02', '03']
Upvotes: 5