Reputation: 63
I have a a filename below and I want to extract year
and _TEXT
part.
fle_2019-11-17A17-21-09.01(_TEXT).txt
I am able to do this using two regex and then join the results.
(?<=\_)(\d{4})(?=\-)
This gives me year
(?<=\()(.*)(?=\))
This gives me _TEXT
Is there a way to get this from a single expression?
Upvotes: 2
Views: 150
Reputation: 163352
One option is to use 2 capturing groups. Depending on what you would allow to match before the first underscore, you could for example use a character class to match word characters without an underscore [^\W_]+
^[^\W_]+_(\d{4})-[\w.-]+\(([^)]+)\)\.\w+$
In parts
^
Start of string[^\W_]+
Match 1+ word chars except _
_
Match the _
(\d{4})
Capture group 1, match 1+ digits-[\w.-]+
Match -
and 1+ word chars, .
or -
(extend the character class with what you would allow to match\(
Match (
([^)]+)
Capture group 2, match 1+ times any char except )
\)
Match )
\.\w+
Match a .
and 1+ word chars$
End of stringFor example
import re
regex = r"^[^\W_]+_(\d{4})-[\w.-]+\(([^)]+)\)\.\w+$"
test_str = "fle_2019-11-17A17-21-09.01(_TEXT).txt"
print(re.findall(regex, test_str))
Output
[('2019', '_TEXT')]
Upvotes: 1
Reputation: 521259
In the interest of simplicity, we could just try using re.findall
with an alternation which captures either a 4 digit year or a file name:
file = "fle_2019-11-17A17-21-09.01(_TEXT).txt"
parts = re.findall(r'\d{4}(?=-\d{2})|(?<=\().*?(?=\))', file)
print(parts)
This prints:
['2019', '_TEXT']
I like this approach because the output already yields separate logical values for the year and file name.
Upvotes: 1