Andrew Tobey
Andrew Tobey

Reputation: 923

python develop non-greedy regex to match specific pattern several times

I am about to develop a regex for a pattern given in a file I want to process.

The file contains several articles, which all follow a similar pattern:

  1. start with a line i.e. newline
  2. then have some non-word characters on a line followed by "Dokument xx von xx" and a newline
  3. that is followed by a body of characters
  4. ends with two newlines, followed by a line with non-word characters followed by "Copyright" followed by more characters and a new line
  5. one optional line containing non-word characters followed by more characters and a new line
  6. finally one line containing non-word characters followed by either "All Rights Reserved" or "Alle Rechte vorbehalten" and a new line

I try to come up with a non-greedy regex, that accurately matches the start, body, and end of the article(s).

For 1-4 I have ^n\W+Dokument.+?[\r\n][\r\n]\W+Copyright[^\n]+\n

What is necessary for 5-6?

Do I actually need a dotall flag if I aim to use this regex as proposed to match the pattern several times in a file?

I have been on this all day. Can someone with a fresh mind show me the missing bits?

Cheers, Andrew

Upvotes: 0

Views: 66

Answers (1)

karthik manchala
karthik manchala

Reputation: 13650

You can use the following:

  1. one optional line containing non-word characters followed by more characters and a new line
(\W+?(?:(?!All|Alle).)+?\n)?
  1. one line containing non-word characters followed by either "All Rights Reserved" or "Alle Rechte vorbehalten" and a new line
\W+(All Rights Reserved|Alle Rechte vorbehalten)\n

Combining 1-6:

^\W+Dokument.+?[\r\n][\r\n]\W+Copyright[^\n]+\n(\W+?(?:(?!All|Alle).)+?\n)?\W+?(?:All Rights Reserved|Alle Rechte vorbehalten)\n

See DEMO

Upvotes: 1

Related Questions