sudormrfbin
sudormrfbin

Reputation: 746

Matching only "featured artist" in a set of filenames -- current regex too greedy

I'm writing a script in python to extract the name of the featured artist from an mp3 filename and set the appropriate id3v2 tag of the file. The filenames are in 3 different formats:

Artist - Track ft. FeatArtist.mp3
Artist ft. FeatArtist - Track.mp3
Artist - Track (ft. FeatArtist).mp3

This is the regex that I wrote:

r'ft\. (.+)[.-)]'

Then I can use re.findall to get the contents of the group. But this is what I get:

In [40]: r = r'ft\. (.+)[.\-)]'

In [47]: re.findall(r, 'Artist - Track ft. FeatArtist.mp3')
Out[47]: ['FeatArtist']

In [48]: re.findall(r, 'Artist ft. FeatArtist - Track.mp3')
Out[48]: ['FeatArtist - Track']

In [49]: re.findall(r, 'Artist - Track (ft. FeatArtist).mp3')
Out[49]: ['FeatArtist)']

My intended output is in all three cases precisely:

FeatArtist

The problem is that the regex is matching as much as it can - I want it to to stop as soon as it finds one of the characters in [.\-)]. How can I do this ?

Upvotes: 0

Views: 173

Answers (2)

hmedia1
hmedia1

Reputation: 6180

For python

For your specific requirement according to your filename formats:

re.findall(r'ft\.\s*(\w*)',filename)

Each of these filenames:

  • Artist - Track ft. FeatArtist.mp3
    Artist ft. FeatArtist - Track.mp3
    Artist - Track (ft. FeatArtist).mp3
    

Will return:

  • ['FeatArtist']
    

If you want to account for a number of other possible scenarios:

In your provided examples, each FeatArtist terminates with one of the following: A space followed by a -, a round close bracket, and the file extension .mp3

If we had any of the following:

  • Feat.Artist
    Feat Artist
    Feat Middlename Artist
    Feat Artist One & Artist Two
    

Things might fall apart. One way to tackle the above variants might be:

First get rid of the file extension without using string matching at all. Doing this with filenames gives you a cleaner starting point:

Using os.path.splitext('Artist - Track ft. FeatArtist.mp3')[0]) you can get your files in this format: Artist - Track ft. FeatArtist

We can accomodate the new filenames with this regex:

  • re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)', filename)
    

Unit Tests: (Listed respectively for easier reading):

>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist - Track ft. FeatArtist')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist ft. FeatArtist - Track')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist - Track (ft. FeatArtist)')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist - Track (ft. Feat Artist)')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist - Track (ft. Feat Artist & Other Artist)')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist ft. Feat Artist & Other Artist - Track')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist ft. Feat.Artist & Crew - Track')

Results:

['FeatArtist']
['FeatArtist']
['FeatArtist']
['Feat Artist']
['Feat Artist & Other Artist']
['Feat Artist & Other Artist']
['Feat.Artist & Crew']

Why no lookbehind ?

From the python man (formatting added):

re.findall(pattern, string, flags=0) Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

Therefore you can still use repition operators to establish the match, and use groups to control the portion of the match returned.


Other ways to do something similar:

If using a regex engine that supports \K back reference, then the match would be everything after the \K:

Examples using grep with -P (Perl Regex) and -o (Only return match):

echo "Artist - Track ft. FeatArtist" | grep -oP "ft\.\s*\K(\w*.*?)(?= -|\)|$)"
FeatArtist

echo "Artist ft. FeatArtist - Track" | grep -oP "ft\.\s*\K(\w*.*?)(?= -|\)|$)"
FeatArtist

echo "Artist - Track (ft. FeatArtist)" | grep -oP "ft\.\s*\K(\w*.*?)(?= -|\)|$)"
FeatArtist

echo "Artist ft. Feat Artist & Other Artist - Track" | grep -oP "ft\.\s*\K(\w*.*?)(?= -|\)|$)"
Feat Artist & Other Artist

Upvotes: 1

Xyzk
Xyzk

Reputation: 1332

This should work:

(?<=ft\. )[^\-)\. ]+

(?<=ft. ) look for a string that has ft. before

)[^-). ]+ the string has to be a word, without spaces/dashes/brackets/dots.

Upvotes: 0

Related Questions