python develop non-greedy regex to match specific pattern several times

Question

I am about to develop a regex for a pattern given in a file I want to process.

The file contains several articles, which all follow a similar pattern:

start with a line i.e. newline
then have some non-word characters on a line followed by "Dokument xx von xx" and a newline
that is followed by a body of characters
ends with two newlines, followed by a line with non-word characters followed by "Copyright" followed by more characters and a new line
one optional line containing non-word characters followed by more characters and a new line
finally one line containing non-word characters followed by either "All Rights Reserved" or "Alle Rechte vorbehalten" and a new line

I try to come up with a non-greedy regex, that accurately matches the start, body, and end of the article(s).

For 1-4 I have ^n\W+Dokument.+?[ ][ ]\W+Copyright[^ ]+

What is necessary for 5-6?

Do I actually need a dotall flag if I aim to use this regex as proposed to match the pattern several times in a file?

I have been on this all day. Can someone with a fresh mind show me the missing bits?

Cheers, Andrew

karthik manchala · Accepted Answer

You can use the following:

one optional line containing non-word characters followed by more characters and a new line

(\W+?(?:(?!All|Alle).)+?
)?

one line containing non-word characters followed by either "All Rights Reserved" or "Alle Rechte vorbehalten" and a new line

\W+(All Rights Reserved|Alle Rechte vorbehalten)

Combining 1-6:

^\W+Dokument.+?[
][
]\W+Copyright[^
]+
(\W+?(?:(?!All|Alle).)+?
)?\W+?(?:All Rights Reserved|Alle Rechte vorbehalten)

Answers (1)