Reputation: 91
I am new to using regex and would really appreciate any help here. I have to parse a file with strings of following formats (main difference being that the second string has an extra "-" string in the middle:
Abc_p123 abc_ghi_data
OR
Abc_de*_p123 abc_ghi_data
I could write a regex to match the first and second strings separately:
data_lst = re.findall('([a-zA-Z0-9]+_p\d{3})\s.*_data.*', content, re.IGNORECASE)
data_lst = re.findall('([a-zA-Z0-9]+_[a-zA-Z]+_p\d{3})\s.*_data.*', content, re.IGNORECASE)
Can someone guide on how to combine the two findall regex, so that it works with both strings. I can still create a combined single list by appending the second findall statement to first list. However, I am sure there is a way to handle it in one findall regex statement. I tried ".*" in the middle but, that gives error.
Please advise. Thanks,
Upvotes: 1
Views: 63
Reputation: 4753
You could try
([a-zA-Z0-9]+(_[a-zA-Z]+)?_p\d{3})\s.*_data.*
I replaced _[a-zA-Z]+
with (_[a-zA-Z]+)?
to make it optional.
And if you don't want the extra capture group, add ?:
like so: (?:_[a-zA-Z]+)?
Demo: https://regex101.com/r/5xynlx/2
Upvotes: 1
Reputation: 20737
You were very close:
([a-zA-Z0-9]+(?:_[a-zA-Z]+\*)?_p\d{3})\s.*_data.*
Here is the important part:
(?:_[a-zA-Z]+\*)?
It says: optionally match an underscore, followed by unlimited a-z, followed by a asterisk.
https://regex101.com/r/5XCsPK/1
Upvotes: 2
Reputation: 18611
Use
([a-zA-Z0-9]+(?:_[a-zA-Z0-9*]+)?_p\d{3})\s.*_data
See proof
Explanation
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[a-zA-Z0-9]+ any character of: 'a' to 'z', 'A' to
'Z', '0' to '9' (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
_ '_'
--------------------------------------------------------------------------------
[a-zA-Z0-9*]+ any character of: 'a' to 'z', 'A' to
'Z', '0' to '9', '*' (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
_p '_p'
--------------------------------------------------------------------------------
\d{3} digits (0-9) (3 times)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
_data '_data'
Upvotes: 0