Reputation: 10033
I have a plain text document that contains varies freeform records that look like one of these two:
Title: Red car
Date: 2021-02-10
Description: This car is very red.
It goes very fast.
There are many like it but this one is mine.
Second:
Title: Blue truck
Date: 2021-02-11
Description: The truck is blue.
It carries a lot of stuff.
Notes: This one looks damaged.
I'm trying to use a regex in Python3 with named groups to capture the fields. The "Notes" field in optional. The closest I've gotten is:
(?:Description:)(?P<description>.+?)\n\n(?:Notes:)?(?P<notes>.+)?
But it's still capturing text into "notes" even when the word "Notes:" doesn't appear in the document. Any suggestions?
Upvotes: 0
Views: 72
Reputation: 163217
You could first match description followed by all lines that do not start with Notes or Title. Then optionally match the line with notes.
Description:(?P<description>.+(?:\r?\n(?!(?:Notes|Title):).*)*)(?:\s*\r?\n(?P<notes>Notes:.*))?
The pattern matches:
Description:
Match literally(?P<description>
Named group description
.+(?:\r?\n(?!(?:Notes|Title):).*)*
Match all lines that do not start with Notes: and Title:)
Close group(?:
Non capture group
\s*\r?\n
Match optional whitespace chars and a newline(?P<notes>Notes:.*)
Named group notes
, match the whole line)?
Close group and make it optionalUpvotes: 0
Reputation: 720
Because of the regex is greedy, so you need exclude Notes:
first, then match next Notes. And the ?
operator should only specify once.
Here's my regex expression:
(?:^Description: (?P<description>(?:(?!^Notes: ).)+))(?:^Notes: (?P<note>.+))?
Please test here:
https://regex101.com/r/CgK1VH/1
But, to be honest, I don't suggest you to do it by regex, especially the text file is too large. The matcher will be very slow.
Just use file.readlines
and use line.startswith
is much better.
Upvotes: 2