I have a question about regex/Python. Sorry if this topic has been discussed millions of times - usually I find the answers on so/google etc. but I'm stuck in the millions of answers with this one.. (To be honest - I own a regex book, but somehow I'm too stupid to really understand it...) For a music-management-system I need to extract information out of paths, providing different sets of options. Here two examples: If the path is: (Case 1) "/The Prodigy/The Fat Of The Land/04 - Funky Stuff.flac" it should extract: artist: "The Prodigy" release: "The Fat Of The Land" Tracknumber: 4 Title: "Funky Stuff" And for eg: (Case 2) "/[XLR 483] The Fat Of The Land/04 - The Prodigy - The Funky Stuff.flac" should extract: catno: "XLR 483" release: "The Fat Of The Land" Tracknumber: 4 artist: "The Prodigy" Title: "Funky Stuff" There is no need for a regex that covers both cases, these are just two examples. I'll then provide them as options (or starting-point to add own ones). Any help would be greatly appreciated! @ S.Lott: I don't have a regex for this, I started with splitting the string: parts = rel_path.split('/') track = parts[-1] release = parts[-2] artist = parts[-3] but this looks like an extremely inflexible and un-elegant solution to me. edit: So far I have something like: pattern = re.compile('^/(?P<artist>[a-zA-Z0-9 ]+)/(?P<release>[a-zA-Z0-9 ]+)/(?P<track>[a-zA-Z0-9 -_]+).[a-zA-Z]*.*') rel_path = '/The Prodigy/The Fat Of The Land/04 - Funky Stuff.flac' match = pattern.search(rel_path) artist = match.group('artist') release = match.group('release') track = match.group('track')

Reputation: 2970

Python regex - extracting directories from path

I have a question about regex/Python. Sorry if this topic has been discussed millions of times - usually I find the answers on so/google etc. but I'm stuck in the millions of answers with this one.. (To be honest - I own a regex book, but somehow I'm too stupid to really understand it...)

For a music-management-system I need to extract information out of paths, providing different sets of options. Here two examples:

If the path is: (Case 1)

"/The Prodigy/The Fat Of The Land/04 - Funky Stuff.flac"

it should extract:

artist: "The Prodigy"
release: "The Fat Of The Land"
Tracknumber: 4
Title: "Funky Stuff"

And for eg: (Case 2)

"/[XLR 483] The Fat Of The Land/04 - The Prodigy - The  Funky Stuff.flac"

should extract:

catno: "XLR 483"
release: "The Fat Of The Land"
Tracknumber: 4
artist: "The Prodigy"
Title: "Funky Stuff"

There is no need for a regex that covers both cases, these are just two examples. I'll then provide them as options (or starting-point to add own ones).

Any help would be greatly appreciated!

@ S.Lott: I don't have a regex for this, I started with splitting the string:

parts = rel_path.split('/')       
track = parts[-1]
release = parts[-2]
artist = parts[-3]

but this looks like an extremely inflexible and un-elegant solution to me.

edit:

So far I have something like:

pattern = re.compile('^/(?P<artist>[a-zA-Z0-9 ]+)/(?P<release>[a-zA-Z0-9 ]+)/(?P<track>[a-zA-Z0-9 -_]+).[a-zA-Z]*.*')


rel_path = '/The Prodigy/The Fat Of The Land/04 - Funky Stuff.flac'

match = pattern.search(rel_path)

artist = match.group('artist')
release = match.group('release')
track = match.group('track')

Upvotes: 3

Answers (4)

Mike

Reputation: 20196

pattern1 = re.compile(r'/([^/]*)/([^/]*)/([0-9]*) - (.*)\.[^.]*')
artist,release,Tracknumber,Title = pattern1.match(file1).groups()

pattern2 = re.compile(r'/\[([^]]*)\] ([^/]*)/([0-9]*) - (.*) - (.*)\.[^.]*')
catno,release,Tracknumber,artist,Title = pattern2.match(file2).groups()

(where file1 and file2 are the paths you gave above).

First thing: you capture something matched by a regex with parentheses. So everything between parentheses below will be spit back out as an item in the match.

Second: you match anything except a forward slash with regex code like [^/]. So to match lots of things between forward slashes, you do [^/]*.

Putting those together, to capture the artist in your first sttring, you do /([^/]*)/. Then you do that again to get the release.

Third: to match any digit, you use [0-9]. So, to match any string of digits, you use [0-9]*.

Apply those principles repeatedly, and you should be able to understand the above.

Upvotes: 3

Will Cheng

Reputation: 199

Although not necessary, but re is handy choice for this problem.

import re
pattern = re.compile(r"/(?P<artist>[a-zA-Z0-9 ]+?)/(?P<release>[a-zA-Z0-9 ]+?)/(?P<tracknumber>\d+?) - (?P<title>[a-zA-Z0-9 ]+?).flac")
s = "/The Prodigy/The Fat Of The Land/04 - Funky Stuff.flac"
m = pattern.search(s)
print m.group('artist')
print m.group('release')
print m.group('track number')
print m.group('title')

I use expressions such as [a-zA-Z0-9 ] to explicitly specify the chars I expect in the string. It is just my preference to have a white-list-like regex to make the code more secure. There are many other ways to compose equivalent patterns. You will find all you need here http://docs.python.org/library/re.html, you don't need a book for that.

Upvotes: 6

Aif

Reputation: 11220

You should fist use split with the / delimiter so that you will be able to have informations just with the size of the array returned by split.

Then you can use regexp if you need. For instance, in the second case: (which happens only if you have two elements right?)

import re
item = "/[XLR 483] The Fat Of The Land/04 - The Prodigy - The  Funky Stuff.flac"
matches = re.search('^\/?\[([^\]]+)](.*)\/', item)
print matches.group(1) # 'XLR 483'
print matches.group(2) # ' The Fat Of The Land'

It may seems a bit complicated, but I have escaped all ambiguous characters so basically, the pattern is the following:

^ at the beginning
/? there can be at most one slash / followed by...
[ a curly brace
([^\]]+) containing all but a closing curly brace one or more times + (and please, capture the values, using the grouping parenthesis) and
] a closing curly brace followed by
(.*) anything but a linefeed (0 or more times *) captured via the parenthesis
and trailing slash /.

hope this helps!

Upvotes: 0

Senthil Kumaran

Reputation: 56841

Here is my approach to the problem that you are having.

Do a split of path, and check if it's of len 4 (first case) or 3 ( second case).
Ignore the first element which should be a single '/' and for the second element, act on it to extract [xxx].
Split by '-' on the last element to get your other information.

If you have any specific doubts, in writing regex, edit your question and follow S.Lott's suggestion.