Python RE. excluding some results

Question

I'm new to RE and I'm trying to take song lyrics and isolate the verse titles, the backing vocals, and main vocals:

Here's an example of some lyrics:

[Intro]
D.A. got that dope!

[Chorus: Travis Scott]
Ice water, turned Atlantic (Freeze)
Nightcrawlin' in the Phantom (Skrrt, Skrrt)...

The verse titles include the square brackets and any words between them. They can be successfully isolated with

r'$${1}.*?$${1}'

The backing vocals are similar to the verse titles, but between (). They've been successfully isolated with:

r'${1}.*?${1}'

For the main vocals, I've used

r'\S+'

which does isolate the main_vocals, but also the verse titles and backing vocals. I cannot figure out how to isolate only the main vocals with simple REs.

Here's a python script that gets the output I desire, but I'd like to do it with REs (as a learning exercise) and cannot figure it out through documentation.

import re

file = 'D:/lyrics.txt'
with open(file, 'r') as f:
    lyrics = f.read()

def find_spans(pattern, string):
    pattern = re.compile(pattern)
    return [match.span() for match in pattern.finditer(string)]

verses = find_spans(r'$${1}.*?$${1}', lyrics)
backing_vocals = find_spans(r'${1}.*?${1}', lyrics)
main_vocals = find_spans(r'\S+', lyrics)

exclude = verses
exclude.extend(backing_vocals)

not_main_vocals = []
for span in exclude:
    start, stop = span
    not_main_vocals.extend(list(range(start, stop)))

main_vocals_temp = []
for span in main_vocals:
    append = True
    start, stop = span
    for i in range(start, stop):
        if i in not_main_vocals: 
            append = False
            continue
    if append == True: 
        main_vocals_temp.append(span)
main_vocals = main_vocals_temp

r.ook · Accepted Answer

Try this Demo:

pattern = r'(?P$$[^$$]+])|(?P$[^$]+\))|(?P[^$$\(]+)'

You can use re.finditer to isolate the groups.

breakdown = {k: [] for k in ('Verse', 'Backing', 'Lyrics')}
for p in pattern.finditer(song):
    for key, item in p.groupdict().items():
        if item: breakdown[key].append(item)

Result:

{
  'Verse': 
    [
      '[Intro]', 
      '[Chorus: Travis Scott]'
    ], 
  'Backing': 
    [
      '(Freeze)', 
      '(Skrrt, Skrrt)'
    ], 
  'Lyrics': 
    [
      '\nD.A. got that dope!\n\n', 
      '\nIce water, turned Atlantic ', 
      "\nNightcrawlin' in the Phantom ", 
      '...'
    ]
}

To elaborate a bit further on the pattern, it's using the named groups to separate the three distinct groups. Using [^$$+] and similar just means to find everything that is not ] (and likewise when \) means everything not )). In the Lyrics part we exclude anything that starts with [ and (. The link to the demo on regex101 would explain the components in more details if you need.

If you don't care for the newlines in the main lyrics, use (?P[^\[\(\n]+) (which excludes the \n) to turn your Lyrics without newlines:

'Lyrics': [
  'D.A. got that dope!', 
  'Ice water, turned Atlantic ',
  "Nightcrawlin' in the Phantom ", 
  '...'
]

Python RE. excluding some results

Answers (2)

Related Questions