Igl3
Igl3

Reputation: 5108

Regex to get string with only one dash

Sorry for the bad question title, I couldn't figure out a better one.

I need a regex, that extracts Season, Episode and Title of Tv Show Transcripts. In my file they can appear like this:

<span class="topic">01x02 - The Big Bran Hypothesis</span><b
<td><b>01x07 - The Dumpling Paradox</b></td>
<title>Transcripts - Forever Dreaming :: 01x07 - The Dumpling Paradox - The Big Bang Theory</title>
<title>Transcripts - Forever Dreaming :: 06x04 - The Re-Entry Minimisation - The Big Bang Theory</title>

I tried with:

([\d]+x[\d]+)\s?[-]?\s?([\w\s]*)

See Regex101 Example here

This regex matches:

01x02 - The Big Bran Hypothesis
01x07 - The Dumpling Paradox
01x07 - The Dumpling Paradox
06x04 - The Re

The issue I'm facing is, how to get the rest of the title of the last one ("The Re-Entry Minimisation") without " - The Big Bang Theory".

I tried by adding a - in the second capturing group, but this includes the part after the title too.

I also tried to add a positive lookahead for - but this also can't work, as it is matching the first - after season and episode too.

I guess it is quite straight forward how to do this, but I can't figure it out. Anyone an idea? Thank you!

Upvotes: 1

Views: 144

Answers (2)

Brian Stephens
Brian Stephens

Reputation: 5261

This regex will successfully match a hyphenated title, while avoiding the trailing show name: (\d+)x(\d+) ?- ?([-\w\s]+) -

It will produce the following capture groups:

  1. Season
  2. Episode
  3. Title

Breakdown:

  • (\d+)x(\d+) matches and captures the season and episode, each in its own group
  • ?- ? matches the dash delimiter, with or without spaces
  • ([-\w\s]+) - captures any letters, dashes, and spaces, but only up to a dash with spaces around it, which seems to be the only distinction between one within the title and after it.

See regex101 demo.

Note: if you really need the entire match to exclude the show name, rather than using the specific groups, just change - to a positive lookahead (?= - ) so it won't match the trailing dash.

Upvotes: 1

Antonio Ken Iannillo
Antonio Ken Iannillo

Reputation: 85

This should work:

(\d{2}x\d{2} - [\w\s]*(-\w)?[\w\s]*)

It also returns you a second group, but you can simply ignore it. Or, actually, you can use the full match simply with

\d{2}x\d{2} - [\w\s]*(-\w)?[\w\s]*

----- EDIT -----

to be correct, the trick is to consider that words may be hyphenated while ignoring actual hyphens.

The following regex is more general and matches something like "out-of-the-box":

\d{2}x\d{2} - ([\w\s]*(-\w)?)*

Upvotes: 0

Related Questions