Reputation: 10033

Regex matching an optional part of a document?

I have a plain text document that contains varies freeform records that look like one of these two:

Title: Red car
Date: 2021-02-10
    
Description: This car is very red.
It goes very fast.

There are many like it but this one is mine.

Second:

Title: Blue truck
Date: 2021-02-11
    
Description: The truck is blue.
It carries a lot of stuff.

Notes: This one looks damaged.

I'm trying to use a regex in Python3 with named groups to capture the fields. The "Notes" field in optional. The closest I've gotten is:

(?:Description:)(?P<description>.+?)\n\n(?:Notes:)?(?P<notes>.+)?

But it's still capturing text into "notes" even when the word "Notes:" doesn't appear in the document. Any suggestions?

Upvotes: 0

Answers (2)

Reputation: 163217

You could first match description followed by all lines that do not start with Notes or Title. Then optionally match the line with notes.

Description:(?P<description>.+(?:\r?\n(?!(?:Notes|Title):).*)*)(?:\s*\r?\n(?P<notes>Notes:.*))?

The pattern matches:

Description: Match literally
(?P<description> Named group description
- .+(?:\r?\n(?!(?:Notes|Title):).*)* Match all lines that do not start with Notes: and Title:
) Close group
(?: Non capture group
- \s*\r?\n Match optional whitespace chars and a newline
- (?P<notes>Notes:.*) Named group notes, match the whole line
)? Close group and make it optional

Upvotes: 0

Reputation: 720

Because of the regex is greedy, so you need exclude Notes: first, then match next Notes. And the ? operator should only specify once.

Here's my regex expression:

(?:^Description: (?P<description>(?:(?!^Notes: ).)+))(?:^Notes: (?P<note>.+))?

Please test here:

But, to be honest, I don't suggest you to do it by regex, especially the text file is too large. The matcher will be very slow.

Just use file.readlines and use line.startswith is much better.

Upvotes: 2