Osuynonma
Osuynonma

Reputation: 499

Python RE. excluding some results

I'm new to RE and I'm trying to take song lyrics and isolate the verse titles, the backing vocals, and main vocals:

Here's an example of some lyrics:

[Intro]
D.A. got that dope!

[Chorus: Travis Scott]
Ice water, turned Atlantic (Freeze)
Nightcrawlin' in the Phantom (Skrrt, Skrrt)...

The verse titles include the square brackets and any words between them. They can be successfully isolated with

r'\[{1}.*?\]{1}'

The backing vocals are similar to the verse titles, but between (). They've been successfully isolated with:

r'\({1}.*?\){1}'

For the main vocals, I've used

r'\S+'

which does isolate the main_vocals, but also the verse titles and backing vocals. I cannot figure out how to isolate only the main vocals with simple REs.

Here's a python script that gets the output I desire, but I'd like to do it with REs (as a learning exercise) and cannot figure it out through documentation.

import re

file = 'D:/lyrics.txt'
with open(file, 'r') as f:
    lyrics = f.read()

def find_spans(pattern, string):
    pattern = re.compile(pattern)
    return [match.span() for match in pattern.finditer(string)]

verses = find_spans(r'\[{1}.*?\]{1}', lyrics)
backing_vocals = find_spans(r'\({1}.*?\){1}', lyrics)
main_vocals = find_spans(r'\S+', lyrics)

exclude = verses
exclude.extend(backing_vocals)

not_main_vocals = []
for span in exclude:
    start, stop = span
    not_main_vocals.extend(list(range(start, stop)))

main_vocals_temp = []
for span in main_vocals:
    append = True
    start, stop = span
    for i in range(start, stop):
        if i in not_main_vocals: 
            append = False
            continue
    if append == True: 
        main_vocals_temp.append(span)
main_vocals = main_vocals_temp

Upvotes: 3

Views: 179

Answers (2)

r.ook
r.ook

Reputation: 13878

Try this Demo:

pattern = r'(?P<Verse>\[[^\]]+])|(?P<Backing>\([^\)]+\))|(?P<Lyrics>[^\[\(]+)'

You can use re.finditer to isolate the groups.

breakdown = {k: [] for k in ('Verse', 'Backing', 'Lyrics')}
for p in pattern.finditer(song):
    for key, item in p.groupdict().items():
        if item: breakdown[key].append(item)

Result:

{
  'Verse': 
    [
      '[Intro]', 
      '[Chorus: Travis Scott]'
    ], 
  'Backing': 
    [
      '(Freeze)', 
      '(Skrrt, Skrrt)'
    ], 
  'Lyrics': 
    [
      '\nD.A. got that dope!\n\n', 
      '\nIce water, turned Atlantic ', 
      "\nNightcrawlin' in the Phantom ", 
      '...'
    ]
}

To elaborate a bit further on the pattern, it's using the named groups to separate the three distinct groups. Using [^\]+] and similar just means to find everything that is not ] (and likewise when \) means everything not )). In the Lyrics part we exclude anything that starts with [ and (. The link to the demo on regex101 would explain the components in more details if you need.

If you don't care for the newlines in the main lyrics, use (?P<Lyrics>[^\[\(\n]+) (which excludes the \n) to turn your Lyrics without newlines:

'Lyrics': [
  'D.A. got that dope!', 
  'Ice water, turned Atlantic ',
  "Nightcrawlin' in the Phantom ", 
  '...'
]

Upvotes: 1

Joe Teague
Joe Teague

Reputation: 131

You could search for the text between close-brackets and open-brackets, using regex groups. If you have a single group (sub-pattern inside round-brackets) in your regex, re.findall will just return the contents of those brackets.

For example, "\[(.*?)\]" would find you just the section labels, not including the square brackets (since they're outside the group).

The regex "\)(.*?)\(" would find just the last line ("\nNightcrawlin' in the Phantom ").
Similarly, we could find the first line with "\](.*?)\[".

Combining the two types of brackets into a character class, the (significantly messier looking) regex "[\]\)](.*?)[\[\(]" captures all of the lyrics.

It will miss lines that don't have brackets before or after them (ie. a the very start before [Intro] if there are any, or at the end if there are no backing vocals afterwards). A possible workaround is to prepend a "]" character and append a "[" character to the end to force a match to start/end at the end of the string. Note we need to add the DOTALL option to make sure the wildcard "." will match the newline character "\n"

import re

lyrics = """[Intro]
D.A. got that dope!

[Chorus: Travis Scott]
Ice water, turned Atlantic (Freeze)
Nightcrawlin' in the Phantom (Skrrt, Skrrt)..."""


matches = re.findall(r"[\]\)](.*?)[\[\(]", "]" + lyrics + "[", re.DOTALL)
main_vocals = '\n'.join(matches)

Upvotes: 1

Related Questions