Reputation: 592
I am trying to parse some text reports into structured data. Typical lines are
Cat. No.: 1 Location: Bottles, boxes etc
Cat. No.: 25 Location: Woods size B EBN: 63.1868
Cat. No.: 24 Location: Woods size B EBN: 12.1980.221
Cat. No.: 20 Location: Woods size B EBN: 4.1973
Cat. No.: 19 Location: Woods size B
The first two values are always present, the last is optional.
/Cat\. No\.: (\d+) Location: (.+)(?: EBN: ([\d\.]+))/
works for lines with all three values but my instinct is that I need to add a ? to the end to make the last part optional i.e.
/Cat\. No\.: (\d+) Location: (.+)(?: EBN: ([\d\.]+))/?
I am then finding that capture group 2 is matching everything after 'Location: ' so for e.g. line 2 it becomes 'Woods size B EBN: 63.1868'
Have saved this at https://regex101.com/r/gd0pKH/1 and would be grateful for any advice. RegEx to match part of string that may or may not be present appears to be the same question and has the same answer I came up with, but for some reason it doesn't seem to be working for me!
Upvotes: 0
Views: 64
Reputation: 15629
You could fix your regex with the following steps:
The second matching group ((.+)
) should be ungready, or it would match everything till the end of the line: (.+?)
You should add an anchor to the end of the line $
, otherwise the regex would stop with the first matching expression - which is obviously the shorter version and in this case, your third matching group would be empty.
Alltogether, you get this:
Cat\. No\.: (\d+) Location: (.+?)(?: EBN: ([\d\.]+))?$
In addition, you could thin about, using \s+
, instead of the six spaces, which makes the expression more flexible.
Cat\. No\.: (\d+)\s+Location: (.+?)(?:\s+EBN: ([\d\.]+))?$
Upvotes: 2
Reputation: 370779
You can have the Location
value repeat lazily, and then use positive lookahead for either two spaces in a row (for a line with EBN
), or the end of the line (for a line without EBN
):
Cat\. No\.: (\d+) Location: (.+?)(?= |$)(?: EBN: ([\d\.]+))?
https://regex101.com/r/gd0pKH/2
Upvotes: 1