srt243
srt243

Reputation: 63

Regex to extract date and specific string

I have a a filename below and I want to extract year and _TEXT part.

fle_2019-11-17A17-21-09.01(_TEXT).txt

I am able to do this using two regex and then join the results.

(?<=\_)(\d{4})(?=\-) This gives me year

(?<=\()(.*)(?=\)) This gives me _TEXT

Is there a way to get this from a single expression?

Upvotes: 2

Views: 150

Answers (2)

The fourth bird
The fourth bird

Reputation: 163352

One option is to use 2 capturing groups. Depending on what you would allow to match before the first underscore, you could for example use a character class to match word characters without an underscore [^\W_]+

^[^\W_]+_(\d{4})-[\w.-]+\(([^)]+)\)\.\w+$

In parts

  • ^ Start of string
  • [^\W_]+ Match 1+ word chars except _
  • _ Match the _
  • (\d{4}) Capture group 1, match 1+ digits
  • -[\w.-]+ Match - and 1+ word chars, . or - (extend the character class with what you would allow to match
  • \( Match (
    • ([^)]+) Capture group 2, match 1+ times any char except )
  • \) Match )
  • \.\w+ Match a . and 1+ word chars
  • $ End of string

Regex demo | Python demo

For example

import re

regex = r"^[^\W_]+_(\d{4})-[\w.-]+\(([^)]+)\)\.\w+$"
test_str = "fle_2019-11-17A17-21-09.01(_TEXT).txt"
print(re.findall(regex, test_str))

Output

[('2019', '_TEXT')]

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521259

In the interest of simplicity, we could just try using re.findall with an alternation which captures either a 4 digit year or a file name:

file = "fle_2019-11-17A17-21-09.01(_TEXT).txt"
parts = re.findall(r'\d{4}(?=-\d{2})|(?<=\().*?(?=\))', file)
print(parts)

This prints:

['2019', '_TEXT']

I like this approach because the output already yields separate logical values for the year and file name.

Upvotes: 1

Related Questions