I am new to using regex and would really appreciate any help here. I have to parse a file with strings of following formats (main difference being that the second string has an extra "-" string in the middle: Abc_p123 abc_ghi_data OR Abc_de*_p123 abc_ghi_data I could write a regex to match the first and second strings separately: data_lst = re.findall('([a-zA-Z0-9]+_p\d{3})\s.*_data.*', content, re.IGNORECASE) data_lst = re.findall('([a-zA-Z0-9]+_[a-zA-Z]+_p\d{3})\s.*_data.*', content, re.IGNORECASE) Can someone guide on how to combine the two findall regex, so that it works with both strings. I can still create a combined single list by appending the second findall statement to first list. However, I am sure there is a way to handle it in one findall regex statement. I tried ".*" in the middle but, that gives error. Please advise. Thanks,

Reputation: 91

Python regex statement errors

I am new to using regex and would really appreciate any help here. I have to parse a file with strings of following formats (main difference being that the second string has an extra "-" string in the middle:

Abc_p123 abc_ghi_data

OR
Abc_de*_p123 abc_ghi_data

I could write a regex to match the first and second strings separately:

data_lst = re.findall('([a-zA-Z0-9]+_p\d{3})\s.*_data.*', content, re.IGNORECASE)
data_lst = re.findall('([a-zA-Z0-9]+_[a-zA-Z]+_p\d{3})\s.*_data.*', content, re.IGNORECASE)

Can someone guide on how to combine the two findall regex, so that it works with both strings. I can still create a combined single list by appending the second findall statement to first list. However, I am sure there is a way to handle it in one findall regex statement. I tried ".*" in the middle but, that gives error.

Please advise. Thanks,

Upvotes: 1

Answers (3)

Vincent

Reputation: 4753

You could try

([a-zA-Z0-9]+(_[a-zA-Z]+)?_p\d{3})\s.*_data.*

I replaced _[a-zA-Z]+ with (_[a-zA-Z]+)? to make it optional.

And if you don't want the extra capture group, add ?: like so: (?:_[a-zA-Z]+)?

Demo: https://regex101.com/r/5xynlx/2

Upvotes: 1

MonkeyZeus

Reputation: 20737

You were very close:

([a-zA-Z0-9]+(?:_[a-zA-Z]+\*)?_p\d{3})\s.*_data.*

Here is the important part:

(?:_[a-zA-Z]+\*)?

It says: optionally match an underscore, followed by unlimited a-z, followed by a asterisk.

https://regex101.com/r/5XCsPK/1

Upvotes: 2

Ryszard Czech

Reputation: 18611

Use

([a-zA-Z0-9]+(?:_[a-zA-Z0-9*]+)?_p\d{3})\s.*_data

See proof

Explanation

--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [a-zA-Z0-9]+             any character of: 'a' to 'z', 'A' to
                             'Z', '0' to '9' (1 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      _                        '_'
--------------------------------------------------------------------------------
      [a-zA-Z0-9*]+            any character of: 'a' to 'z', 'A' to
                               'Z', '0' to '9', '*' (1 or more times
                               (matching the most amount possible))
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
    _p                       '_p'
--------------------------------------------------------------------------------
    \d{3}                    digits (0-9) (3 times)
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
  .*                       any character except \n (0 or more times
                           (matching the most amount possible))
--------------------------------------------------------------------------------
  _data                    '_data'

Upvotes: 0

Python regex statement errors

Answers (3)

Related Questions