Reputation: 2602
What regex can I use to match ".#,#." within a string. It may or may not exist in the string. Some examples with expected outputs might be:
Test1.0,0.csv -> ('Test1', '0,0', 'csv') (Basic Example)
Test2.wma -> ('Test2', 'wma') (No Match)
Test3.1100,456.jpg -> ('Test3', '1100,456', 'jpg') (Basic with Large Number)
T.E.S.T.4.5,6.png -> ('T.E.S.T.4', '5,6', 'png') (Doesn't strip all periods)
Test5,7,8.sss -> ('Test5,7,8', 'sss') (No Match)
Test6.2,3,4.png -> ('Test6.2,3,4', 'png') (No Match, to many commas)
Test7.5,6.7,8.test -> ('Test7', '5,6', '7,8', 'test') (Double Match?)
The last one isn't too important and I would only expect that .#,#. would appear once. Most files I'm processing, I would expect to fall into the first through fourth examples, so I'm most interested in those.
Thanks for the help!
Upvotes: 10
Views: 355
Reputation: 43703
Use regex pattern ^([^,]+)\.(\d+,\d+)\.([^,.]+)$
Check this demo >>
>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test1.0,0.csv')
[('Test1', '0,0', 'csv')]
>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test2.wma')
[]
>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test3.1100,456.jpg')
[('Test3', '1100,456', 'jpg')]
>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'T.E.S.T.4.5,6.png')
[('T.E.S.T.4', '5,6', 'png')]
>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test5,7,8.sss')
[]
>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test6.2,3,4.png')
[]
>>> print re.findall(r'^([^,]+)\.(\d+,\d+)\.([^,.]+)$', 'Test7.5,6.7,8.test')
[]
Upvotes: 0
Reputation: 208725
You can use the regex \.\d+,\d+\.
to find all matches for that pattern, but you will need to do a little extra to get the output you expect, especially since you want to treat .5,6.7,8.
as two matches.
Here is one potential solution:
def transform(s):
s = re.sub(r'(\.\d+,\d+)+\.', lambda m: m.group(0).replace('.', '\n'), s)
return tuple(s.split('\n'))
Examples:
>>> transform('Test1.0,0.csv')
('Test1', '0,0', 'csv')
>>> transform('Test2.wma')
('Test2.wma',)
>>> transform('Test3.1100,456.jpg')
('Test3', '1100,456', 'jpg')
>>> transform('T.E.S.T.4.5,6.png')
('T.E.S.T.4', '5,6', 'png')
>>> transform('Test5,7,8.sss')
('Test5,7,8.sss',)
>>> transform('Test6.2,3,4.png')
('Test6.2,3,4.png',)
>>> transform('Test7.5,6.7,8.test')
('Test7', '5,6', '7,8', 'test')
To also get the file extension split off when there are no matches, you can use the following:
def transform(s):
s = re.sub(r'(\.\d+,\d+)+\.', lambda m: m.group(0).replace('.', '\n'), s)
groups = s.split('\n')
groups[-1:] = groups[-1].rsplit('.', 1)
return tuple(groups)
This will be the same output as above except that 'Test2.wma'
becomes ('Test2', 'wma')
, with similar behavior for 'Test5,7,8.sss'
and 'Test5,7,8.sss'
.
Upvotes: 4
Reputation: 179717
To allow for multiple consecutive matches, use lookahead/lookbehind:
r'(?<=\.)\d+,\d+(?=\.)'
Example:
>>> re.findall(r'(?<=\.)\d+,\d+(?=\.)', 'Test7.5,6.7,8.test')
['5,6', '7,8']
We can also use lookahead to perform the split as you want it:
import re
def split_it(s):
pieces = re.split(r'\.(?=\d+,\d+\.)', s)
pieces[-1:] = pieces[-1].rsplit('.', 1) # split off extension
return pieces
Testing:
>>> print split_it('Test1.0,0.csv')
['Test1', '0,0', 'csv']
>>> print split_it('Test2.wma')
['Test2', 'wma']
>>> print split_it('Test3.1100,456.jpg')
['Test3', '1100,456', 'jpg']
>>> print split_it('T.E.S.T.4.5,6.png')
['T.E.S.T.4', '5,6', 'png']
>>> print split_it('Test5,7,8.sss')
['Test5,7,8', 'sss']
>>> print split_it('Test6.2,3,4.png')
['Test6.2,3,4', 'png']
>>> print split_it('Test7.5,6.7,8.test')
['Test7', '5,6', '7,8', 'test']
Upvotes: 3
Reputation: 22064
This is pretty close, does python support named groups?
^.*(?P<group1>\d+(?:,\d+)?)\.(?P<group2>\d+(?:,\d+)?).*\..+$
Upvotes: 0
Reputation: 12601
^(.*?)\.(\d+,\d+)\.(.*?)$
This passes your tests, at least in Patterns:
Upvotes: 0
Reputation: 44289
'/^(.+)\.((\d+,\d+)\.)?(.+)$/'
The third capturing group should contain the pair of numbers. If you have multiple of those pairs, you should get multiple matches. And the third capturing would always contain the pair.
Upvotes: 0