Reputation: 5108
Sorry for the bad question title, I couldn't figure out a better one.
I need a regex, that extracts Season, Episode and Title of Tv Show Transcripts. In my file they can appear like this:
<span class="topic">01x02 - The Big Bran Hypothesis</span><b
<td><b>01x07 - The Dumpling Paradox</b></td>
<title>Transcripts - Forever Dreaming :: 01x07 - The Dumpling Paradox - The Big Bang Theory</title>
<title>Transcripts - Forever Dreaming :: 06x04 - The Re-Entry Minimisation - The Big Bang Theory</title>
I tried with:
([\d]+x[\d]+)\s?[-]?\s?([\w\s]*)
This regex matches:
01x02 - The Big Bran Hypothesis
01x07 - The Dumpling Paradox
01x07 - The Dumpling Paradox
06x04 - The Re
The issue I'm facing is, how to get the rest of the title of the last one ("The Re-Entry Minimisation") without " - The Big Bang Theory"
.
I tried by adding a -
in the second capturing group, but this includes the part after the title too.
I also tried to add a positive lookahead for -
but this also can't work, as it is matching the first -
after season and episode too.
I guess it is quite straight forward how to do this, but I can't figure it out. Anyone an idea? Thank you!
Upvotes: 1
Views: 144
Reputation: 5261
This regex will successfully match a hyphenated title, while avoiding the trailing show name:
(\d+)x(\d+) ?- ?([-\w\s]+) -
It will produce the following capture groups:
Breakdown:
(\d+)x(\d+)
matches and captures the season and episode, each in its own group?- ?
matches the dash delimiter, with or without spaces([-\w\s]+) -
captures any letters, dashes, and spaces, but only up to a dash with spaces around it, which seems to be the only distinction between one within the title and after it.See regex101 demo.
Note: if you really need the entire match to exclude the show name, rather than using the specific groups, just change -
to a positive lookahead (?= - )
so it won't match the trailing dash.
Upvotes: 1
Reputation: 85
This should work:
(\d{2}x\d{2} - [\w\s]*(-\w)?[\w\s]*)
It also returns you a second group, but you can simply ignore it. Or, actually, you can use the full match simply with
\d{2}x\d{2} - [\w\s]*(-\w)?[\w\s]*
----- EDIT -----
to be correct, the trick is to consider that words may be hyphenated while ignoring actual hyphens.
The following regex is more general and matches something like "out-of-the-box":
\d{2}x\d{2} - ([\w\s]*(-\w)?)*
Upvotes: 0