Reputation: 33
First off, I am new to regex. But so far I am in love with them. I am using regex to extract info from an image files name that I get from render engine. So far this regex is working decently...
_([a-z]{2,8})_?(\d{1,2})?(\.|_)(\d{3,10})\.([a-z]{2,6})$
If I use the split() method on a file name such as...
image_file_name_ao.0001.exr
I get back I nice little list I can use....
['image_file_name', 'gi', None, '.', '0001', 'exr', '']
My only concern is that it always returns an empty string last. No matter how I change or manipulate the regex it always gives me an empty string at the end of the list. I am totally comfortable with ignoring it and moving on, but my question is am I doing something wrong with my regex or is there something I can do to make it not pass that final empty string? Thank you for your time.
Upvotes: 3
Views: 1986
Reputation: 2136
You can use filter()
Given your example this would work like,
def f(x):
return x != ''
filter
(
f,
re.split('_([a-z]{2,8})_?(\d{1,2})?(\.|_)(\d{3,10})\.([a-z]{2,6})$',
'image_file_name_ao.0001.exr')
)
Upvotes: 1
Reputation: 1953
No wonder. The split
method splits your string at occurences of the regex (plus returns group ranges). And since your regex matches only substrings which reach until the end of the line (indicated by the $
at its end), there is nothing to split off at the file name's end but an empty suffix (''
).
Given that you are already using groups "(...)
" in your expression, you could as well use re.match(regex, string)
. This will give you a MatchObject
instance, from which you can retrieve a tuple containing your groups via groups()
:
# additional group up front
reg='(\S*)_([a-z]{2,8})_?(\d{1,2})?(\.|_)(\d{3,10})\.([a-z]{2,6})$'
print re.match(reg, filename).groups() # request tuple of group matches
Edit: I'm really sorry but I didn't realize that your pattern does not match the file name string from its first character on. I extended it in my answer. If you want to stick with your approach using split()
, you might also change your original pattern in a way that the last part of the file name is not matched and hence split off.
Upvotes: 3
Reputation: 27575
Interesting question.
I changed a little the regex's pattern:
import re
reg = re.compile('_([a-z]{2,8})'
'_?(\d\d?)?'
'([._])'
'(\d{3,10})'
'\.'
'(?=[a-z]{2,6}$)')
for ss in ('image_file_name_ao.0001.exr',
'image_file_name_45_ao.0001.exr',
'image_file_name_ao_78.0001.exr',
'image_file_name_ao78.0001.exr'):
print '%s\n%r\n' % ( ss, reg.split(ss) )
result
image_file_name_ao.0001.exr
['image_file_name', 'ao', None, '.', '0001', 'exr']
image_file_name_45_ao.0001.exr
['image_file_name_45', 'ao', None, '.', '0001', 'exr']
image_file_name_ao_78.0001.exr
['image_file_name', 'ao', '78', '.', '0001', 'exr']
image_file_name_ao78.0001.exr
['image_file_name', 'ao', '78', '.', '0001', 'exr']
Upvotes: 1