ArchonOSX
ArchonOSX

Reputation: 121

Regext to find and remove lyrics

I have a workflow that separates the subtitles from a video and then I strip the hearing impaired and other extraneous material from the file. I just want the speaking parts.

The lyrics to songs are sometimes interspersed with speaking parts of the subtitle and this is quite irritating to me so I strip them out. Unfortunately, there are multiple patterns that the creators of the subtitles use to show lyrics. Almost all of them use some form of musical notes with text between consisting of 1, 2, or maybe 3 lines of text.

I have been using BBEdit and the following regex to find the ones formatted for normal text:

♪ .+\r.+\r.+ ♪|♪ .+\r.+ ♪|♪ .+ ♪|♪ .+\r

This finds the 3 line versions, then the 2 line and then the single line.

I am looking to simplify this to anything that begins and ends with a musical note with any amount of text and any number of lines between with a break for a return without any text.

An example:

906
00:42:10,597 --> 00:42:12,530
♪ When I woke up
I opened my eyes♪

907

Can someone please show me how to do this without all the OR operators?

Maybe something like this? ♪.+\r{1,3}♪ (but I know this does not work)

Bonus points for a regex that finds the lyrics inside italics html blocks keeping in mind each html block would need to be selected in its entirety.

I have been using this one (without the explanatory info):

Find all lyrics within html blocks 3, 2, & 1 line
<i>♪ .+\r.+\r.+ ♪</i>|<i>♪ .+\r.+ ♪</i>|<i>♪ .+ ♪</i>

Thanks for reading this and Happy Day!

Upvotes: 1

Views: 82

Answers (1)

The fourth bird
The fourth bird

Reputation: 163632

The pattern ♪.+\r{1,3}♪ will repeat a carriage return 1 to 3 times instead of 1 to 3 whole lines.

According to the BBEdit regex reference, you could for example make the pattern very specific matching the line with a music not followed by 0-2 lines ending on a music note, and optionally match the surrounding italic tags with a conditional.

Note that matching tags like <i>...</i> is not fool proof as a regex has no notion of html/xml structure.

(<i>\s*)?♪(?!\s*♪)[^\r♪]*(?:\r[^\r♪]*){0,2}♪(?(1)\s*<\/i>)

The pattern matches:

  • (<i>\s*)? Optional capture group 1, match <i> and optional whitespace chars
  • Match the music note
  • (?!\s*♪) Negative lookahead, assert not only whitespace chars till the next occurrence of ♪
  • [^\r♪]* Match the rest of the line without carriage returns or a music note
  • (?:\r[^\r♪]*){0,2} Match 0-2 lines without carriage returns or music notes (so the total is 1 - 3 lines)
  • Match the music note
  • (?(1)\s*<\/i>) Conditional, if we have group 1, then match optional whitespace chars and </i>

As \s* can also match newlines and you don't want that as a line count, you could omit it if the music note is directly after the italic tag, or use [^\S\r]* to match whitespace characters without carriage returns.

› See a regex demo or another demo when you don't want to allow angle brackets in the lyrics.


Some other options, matching with the italics only:

<i>♪(?!\s*♪).*(?:\r.*?){0,2}♪<\/i>

Without the italics, and for example matching from the start of the string:

^♪(?!\s*♪).*(?:\r.*?){0,2}♪

Upvotes: 1

Related Questions