jcam77
jcam77

Reputation: 173

Searching for multiple substrings of unknown size in string in python

I've seen lots of RE stuff in python but nothing for the exact case and I can't seem to get it. I have a list of files with names that look like this:

summary_Cells_a_01_2_1_45000_it_1.txt
summary_Cells_a_01_2_1_40000_it_2.txt
summary_Cells_bb_01_2_1_36000_it_3.txt

The "summary_Cells_" is always present. Then there is a string of letters, either 1, 2 or 3 long. Then there is "_01_2_1_" always. Then there is a number between 400 and 45000. Then there is "it" and then a number from 0-9, then ".txt"

I need to extract the letter(s) piece.

I was trying:

match = re.search('summary_Cells_(\w)_01_2_1_(\w)_it_(\w).txt', filename)

but was not getting anything for the match. I'm trying to get just the letters, but later might want the it number (last number) or the step (the middle number).

Any ideas?

Thanks

Upvotes: 0

Views: 1391

Answers (5)

Spice
Spice

Reputation: 352

You're on the right track with your regex, but as everyone else forgets, \w includes alphanumerics and the underscore, so you should use [a-z] instead.

re.search(r"summary_Cells_([a-z]+)_\w+\.txt", filename)

Or, as Padraic mentioned, you can just use str.split("_").

Upvotes: 0

outlyer
outlyer

Reputation: 3943

You're missing repetitions, i.e.:

re.search('summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt', filename)

\w will only match a single character
\w+ will match at least one
\w* will match any amount (0 or more)

Upvotes: 3

Jivan
Jivan

Reputation: 23038

Since you only want to capture the letters at the beginning, you could do:

re.search('summary_Cells_(\w+)_01_2_1_[0-9]{3,6}_it_[0-9].txt', filename)

Which doesn't bother giving you the groups you don't need.

[0-9] looks for a number and [0-9]{3,6} allows for 3 to 6 numbers.

Upvotes: 0

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

You don't need a regex, there is nothing complex about the pattern and it does not change:

s = "summary_Cells_a_01_2_1_45000_it_1.txt"
print(s.split("_")[2])
a
s = "summary_Cells_bb_01_2_1_36000_it_3.txt"
print(s.split("_")[2])
bb

If you want both sets of lettrrs:

s = "summary_Cells_bb_01_2_1_36000_it_3.txt"
spl = s.split("_")
a,b = spl[2],spl[7]
print(a,b)
('bb', 'it')

Upvotes: 1

nu11p01n73R
nu11p01n73R

Reputation: 26667

You were almost there all you need to do is to repeat the regex in caputure group

summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt

Example usage

>>> filename="summary_Cells_a_01_2_1_45000_it_1.txt"
>>> match = re.search(r'summary_Cells_(\w+)_01_2_1_(\w+)_it_(\w+).txt', filename)
>>> match.group()
'summary_Cells_a_01_2_1_45000_it_1.txt'
>>> match.group(0)
'summary_Cells_a_01_2_1_45000_it_1.txt'
>>> match.group(1)
'a'
>>> match.group(2)
'45000'
>>> match.group(3)
'1'

Note

The match.group(n) will return the value captured by the nth caputre group

Upvotes: 1

Related Questions