user147862
user147862

Reputation:

Regex to ensure group match doesn't end with a specific character

I'm having trouble coming up with a regular expression to match a particular case. I have a list of tv shows in about 4 formats:

What I want to match is the show name. My main problem is that my regex matches the name of the show with a preceding '.'. My regex is the following:

"^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})"

Some Examples:

>>> import re

>>> SHOW_INFO = re.compile("^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})")
>>> match = SHOW_INFO.match("Name.Of.Show.S01E01")
>>> match.groups()
('Name.Of.Show.', 'S01E01')
>>> match = SHOW_INFO.match("Name.Of.Show.0101")
>>> match.groups()
('Name.Of.Show.0', '101')
>>> match = SHOW_INFO.match("Name.Of.Show.01x01")
>>> match.groups()
('Name.Of.Show.', '01x01')
>>> match = SHOW_INFO.match("Name.Of.Show.101")
>>> match.groups()
('Name.Of.Show.', '101')

So the question is how do I avoid the first group ending with a period? I realize I could simply do:

var.strip(".")

However, that doesn't handle the case of "Name.Of.Show.0101". Is there a way I could improve the regex to handle that case better?

Thanks in advance.

Upvotes: 1

Views: 2669

Answers (5)

ABach
ABach

Reputation: 3738

I believe this will do what you want:

^([0-9a-z\.]+)\.(?:S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}(?:x[0-9]+)?)$

I tested this against the following list of shows:

  • 30.Rock.S01E01
  • The.Office.0101
  • Lost.01x01
  • How.I.Met.Your.Mother.101

If those 4 cases are representative of the types of files you have, then that regex should place the show title in its own capture group and toss away the rest. This filter is, perhaps, a bit more restrictive than some others, but I'm a big fan of matching exactly what you need.

Upvotes: 0

SilentGhost
SilentGhost

Reputation: 319551

I think this will do:

>>> regex = re.compile(r'^([0-9a-z.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}x[0-9]{2})$', re.I)
>>> regex.match('Name.Of.Show.01x01').groups()
('Name.Of.Show', '01x01')
>>> regex.match('Name.Of.Show.101').groups()
('Name.Of.Show', '101')

ETA: Of course, if you're just trying to extract different bits from trusted strings you could just use string methods:

>>> 'Name.Of.Show.101'.rpartition('.')
('Name.Of.Show', '.', '101')

Upvotes: 1

Mark M
Mark M

Reputation: 976

It seems like the problem is that you haven't specified that the period before the last group is required, so something like ^([0-9a-zA-Z\.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3}) might work.

Upvotes: 0

Jan Willem B
Jan Willem B

Reputation: 3806

If the last part never contains a dot: ^(.*)\.([^\.]+)$

Upvotes: 0

Konrad Rudolph
Konrad Rudolph

Reputation: 545528

So the only real restriction on the last group is that it doesn’t contain a dot? Easy:

^(.*?)(\.[^.]+)$

This matches anything, non-greedily. The important part is the second group, which starts with a dot and then matches any non-dot character until the end of the string.

This works with all your test cases.

Upvotes: 1

Related Questions