Matching only "featured artist" in a set of filenames -- current regex too greedy

Question

I'm writing a script in python to extract the name of the featured artist from an mp3 filename and set the appropriate id3v2 tag of the file. The filenames are in 3 different formats:

Artist - Track ft. FeatArtist.mp3
Artist ft. FeatArtist - Track.mp3
Artist - Track (ft. FeatArtist).mp3

This is the regex that I wrote:

r'ft\. (.+)[.-)]'

Then I can use re.findall to get the contents of the group. But this is what I get:

In [40]: r = r'ft\. (.+)[.\-)]'

In [47]: re.findall(r, 'Artist - Track ft. FeatArtist.mp3')
Out[47]: ['FeatArtist']

In [48]: re.findall(r, 'Artist ft. FeatArtist - Track.mp3')
Out[48]: ['FeatArtist - Track']

In [49]: re.findall(r, 'Artist - Track (ft. FeatArtist).mp3')
Out[49]: ['FeatArtist)']

My intended output is in all three cases precisely:

FeatArtist

The problem is that the regex is matching as much as it can - I want it to to stop as soon as it finds one of the characters in [.\-)]. How can I do this ?

hmedia1 · Accepted Answer

For python

For your specific requirement according to your filename formats:

re.findall(r'ft\.\s*(\w*)',filename)

Each of these filenames:

Artist - Track ft. FeatArtist.mp3
Artist ft. FeatArtist - Track.mp3
Artist - Track (ft. FeatArtist).mp3

Will return:

```
['FeatArtist']
```

If you want to account for a number of other possible scenarios:

In your provided examples, each FeatArtist terminates with one of the following: A space followed by a -, a round close bracket, and the file extension .mp3

If we had any of the following:

Feat.Artist
Feat Artist
Feat Middlename Artist
Feat Artist One & Artist Two

Things might fall apart. One way to tackle the above variants might be:

First get rid of the file extension without using string matching at all. Doing this with filenames gives you a cleaner starting point:

Using os.path.splitext('Artist - Track ft. FeatArtist.mp3')[0]) you can get your files in this format: Artist - Track ft. FeatArtist

We can accomodate the new filenames with this regex:

re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)', filename)

Unit Tests: (Listed respectively for easier reading):

>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist - Track ft. FeatArtist')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist ft. FeatArtist - Track')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist - Track (ft. FeatArtist)')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist - Track (ft. Feat Artist)')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist - Track (ft. Feat Artist & Other Artist)')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist ft. Feat Artist & Other Artist - Track')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist ft. Feat.Artist & Crew - Track')

Results:

['FeatArtist']
['FeatArtist']
['FeatArtist']
['Feat Artist']
['Feat Artist & Other Artist']
['Feat Artist & Other Artist']
['Feat.Artist & Crew']

Why no lookbehind ?

From the python man (formatting added):

re.findall(pattern, string, flags=0) Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

Therefore you can still use repition operators to establish the match, and use groups to control the portion of the match returned.

Other ways to do something similar:

If using a regex engine that supports \K back reference, then the match would be everything after the \K:

Examples using grep with -P (Perl Regex) and -o (Only return match):

echo "Artist - Track ft. FeatArtist" | grep -oP "ft\.\s*\K(\w*.*?)(?= -|\)|$)"
FeatArtist

echo "Artist ft. FeatArtist - Track" | grep -oP "ft\.\s*\K(\w*.*?)(?= -|\)|$)"
FeatArtist

echo "Artist - Track (ft. FeatArtist)" | grep -oP "ft\.\s*\K(\w*.*?)(?= -|\)|$)"
FeatArtist

echo "Artist ft. Feat Artist & Other Artist - Track" | grep -oP "ft\.\s*\K(\w*.*?)(?= -|\)|$)"
Feat Artist & Other Artist

Matching only "featured artist" in a set of filenames -- current regex too greedy

Answers (2)

For python

For your specific requirement according to your filename formats:

If you want to account for a number of other possible scenarios:

We can accomodate the new filenames with this regex:

Unit Tests: (Listed respectively for easier reading):

Results:

Why no lookbehind ?

Other ways to do something similar:

Related Questions

Matching only &quot;featured artist&quot; in a set of filenames -- current regex too greedy

Answers (2)

For python

For your specific requirement according to your filename formats:

If you want to account for a number of other possible scenarios:

We can accomodate the new filenames with this regex:

Unit Tests: (Listed respectively for easier reading):

Results:

Why no lookbehind ?

Other ways to do something similar:

Related Questions

Matching only "featured artist" in a set of filenames -- current regex too greedy