Reputation: 173
I've seen lots of RE stuff in python but nothing for the exact case and I can't seem to get it. I have a list of files with names that look like this:
summary_Cells_a_01_2_1_45000_it_1.txt
summary_Cells_a_01_2_1_40000_it_2.txt
summary_Cells_bb_01_2_1_36000_it_3.txt
The "summary_Cells_" is always present. Then there is a string of letters, either 1, 2 or 3 long. Then there is "_01_2_1_" always. Then there is a number between 400 and 45000. Then there is "it" and then a number from 0-9, then ".txt"
I need to extract the letter(s) piece.
I was trying:
match = re.search('summary_Cells_(\w)_01_2_1_(\w)_it_(\w).txt', filename)
but was not getting anything for the match. I'm trying to get just the letters, but later might want the it number (last number) or the step (the middle number).
Any ideas?
Thanks
Upvotes: 0
Views: 1391
Reputation: 352
You're on the right track with your regex, but as everyone else forgets, \w
includes alphanumerics and the underscore, so you should use [a-z]
instead.
re.search(r"summary_Cells_([a-z]+)_\w+\.txt", filename)
Or, as Padraic mentioned, you can just use str.split("_")
.
Upvotes: 0
Reputation: 3943
You're missing repetitions, i.e.:
re.search('summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt', filename)
\w
will only match a single character
\w+
will match at least one
\w*
will match any amount (0 or more)
Upvotes: 3
Reputation: 23038
Since you only want to capture the letters at the beginning, you could do:
re.search('summary_Cells_(\w+)_01_2_1_[0-9]{3,6}_it_[0-9].txt', filename)
Which doesn't bother giving you the groups you don't need.
[0-9]
looks for a number and [0-9]{3,6}
allows for 3 to 6 numbers.
Upvotes: 0
Reputation: 180391
You don't need a regex, there is nothing complex about the pattern and it does not change:
s = "summary_Cells_a_01_2_1_45000_it_1.txt"
print(s.split("_")[2])
a
s = "summary_Cells_bb_01_2_1_36000_it_3.txt"
print(s.split("_")[2])
bb
If you want both sets of lettrrs:
s = "summary_Cells_bb_01_2_1_36000_it_3.txt"
spl = s.split("_")
a,b = spl[2],spl[7]
print(a,b)
('bb', 'it')
Upvotes: 1
Reputation: 26667
You were almost there all you need to do is to repeat the regex in caputure group
summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt
Example usage
>>> filename="summary_Cells_a_01_2_1_45000_it_1.txt"
>>> match = re.search(r'summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt', filename)
>>> match.group()
'summary_Cells_a_01_2_1_45000_it_1.txt'
>>> match.group(0)
'summary_Cells_a_01_2_1_45000_it_1.txt'
>>> match.group(1)
'a'
>>> match.group(2)
'45000'
>>> match.group(3)
'1'
Note
The match.group(n)
will return the value captured by the nth caputre group
Upvotes: 1