Reputation: 746
I'm writing a script in python to extract the name of the featured artist from an mp3 filename and set the appropriate id3v2 tag of the file. The filenames are in 3 different formats:
Artist - Track ft. FeatArtist.mp3
Artist ft. FeatArtist - Track.mp3
Artist - Track (ft. FeatArtist).mp3
This is the regex that I wrote:
r'ft\. (.+)[.-)]'
Then I can use re.findall
to get the contents of the group. But this is what I get:
In [40]: r = r'ft\. (.+)[.\-)]'
In [47]: re.findall(r, 'Artist - Track ft. FeatArtist.mp3')
Out[47]: ['FeatArtist']
In [48]: re.findall(r, 'Artist ft. FeatArtist - Track.mp3')
Out[48]: ['FeatArtist - Track']
In [49]: re.findall(r, 'Artist - Track (ft. FeatArtist).mp3')
Out[49]: ['FeatArtist)']
My intended output is in all three cases precisely:
FeatArtist
The problem is that the regex is matching as much as it can - I want it to to stop as soon as it finds one of the characters in [.\-)]
. How can I do this ?
Upvotes: 0
Views: 173
Reputation: 6180
re.findall(r'ft\.\s*(\w*)',filename)
Each of these filenames:
Artist - Track ft. FeatArtist.mp3 Artist ft. FeatArtist - Track.mp3 Artist - Track (ft. FeatArtist).mp3
Will return:
['FeatArtist']
In your provided examples, each FeatArtist
terminates with one of the following: A space followed by a -
, a round close bracket, and the file extension .mp3
If we had any of the following:
Feat.Artist Feat Artist Feat Middlename Artist Feat Artist One & Artist Two
Things might fall apart. One way to tackle the above variants might be:
First get rid of the file extension without using string matching at all. Doing this with filenames gives you a cleaner starting point:
Using os.path.splitext('Artist - Track ft. FeatArtist.mp3')[0])
you can get your files in this format: Artist - Track ft. FeatArtist
re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)', filename)
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist - Track ft. FeatArtist')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist ft. FeatArtist - Track')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist - Track (ft. FeatArtist)')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist - Track (ft. Feat Artist)')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist - Track (ft. Feat Artist & Other Artist)')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist ft. Feat Artist & Other Artist - Track')
>>> re.findall(r'ft\.\s*(\w*.*?)(?= -|\)|$)','Artist ft. Feat.Artist & Crew - Track')
['FeatArtist']
['FeatArtist']
['FeatArtist']
['Feat Artist']
['Feat Artist & Other Artist']
['Feat Artist & Other Artist']
['Feat.Artist & Crew']
From the python man (formatting added):
re.findall(pattern, string, flags=0) Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
Therefore you can still use repition operators to establish the match, and use groups to control the portion of the match returned.
If using a regex engine that supports \K
back reference, then the match would be everything after the \K
:
Examples using grep
with -P
(Perl Regex) and -o
(Only return match):
echo "Artist - Track ft. FeatArtist" | grep -oP "ft\.\s*\K(\w*.*?)(?= -|\)|$)"
FeatArtist
echo "Artist ft. FeatArtist - Track" | grep -oP "ft\.\s*\K(\w*.*?)(?= -|\)|$)"
FeatArtist
echo "Artist - Track (ft. FeatArtist)" | grep -oP "ft\.\s*\K(\w*.*?)(?= -|\)|$)"
FeatArtist
echo "Artist ft. Feat Artist & Other Artist - Track" | grep -oP "ft\.\s*\K(\w*.*?)(?= -|\)|$)"
Feat Artist & Other Artist
Upvotes: 1
Reputation: 1332
This should work:
(?<=ft\. )[^\-)\. ]+
(?<=ft. )
look for a string that has ft.
before
)[^-). ]+
the string has to be a word, without spaces/dashes/brackets/dots.
Upvotes: 0