jamieb
jamieb

Reputation: 10033

Regex matching an optional part of a document?

I have a plain text document that contains varies freeform records that look like one of these two:

Title: Red car
Date: 2021-02-10
    
Description: This car is very red.
It goes very fast.

There are many like it but this one is mine.

Second:

Title: Blue truck
Date: 2021-02-11
    
Description: The truck is blue.
It carries a lot of stuff.

Notes: This one looks damaged.

I'm trying to use a regex in Python3 with named groups to capture the fields. The "Notes" field in optional. The closest I've gotten is:

(?:Description:)(?P<description>.+?)\n\n(?:Notes:)?(?P<notes>.+)?

But it's still capturing text into "notes" even when the word "Notes:" doesn't appear in the document. Any suggestions?

Upvotes: 0

Views: 72

Answers (2)

The fourth bird
The fourth bird

Reputation: 163217

You could first match description followed by all lines that do not start with Notes or Title. Then optionally match the line with notes.

Description:(?P<description>.+(?:\r?\n(?!(?:Notes|Title):).*)*)(?:\s*\r?\n(?P<notes>Notes:.*))?

The pattern matches:

  • Description: Match literally
  • (?P<description> Named group description
    • .+(?:\r?\n(?!(?:Notes|Title):).*)* Match all lines that do not start with Notes: and Title:
  • ) Close group
  • (?: Non capture group
    • \s*\r?\n Match optional whitespace chars and a newline
    • (?P<notes>Notes:.*) Named group notes, match the whole line
  • )? Close group and make it optional

Regex demo

Upvotes: 0

Yang HG
Yang HG

Reputation: 720

Because of the regex is greedy, so you need exclude Notes: first, then match next Notes. And the ? operator should only specify once.

Here's my regex expression:

(?:^Description: (?P<description>(?:(?!^Notes: ).)+))(?:^Notes: (?P<note>.+))?

Please test here:

https://regex101.com/r/CgK1VH/1

But, to be honest, I don't suggest you to do it by regex, especially the text file is too large. The matcher will be very slow.

Just use file.readlines and use line.startswith is much better.

Upvotes: 2

Related Questions