Python: regex to recognize document header

Question

I have parsed a document into separate sentences, but some of the parsed sentences also contain the document's headers. This means a few sentences look like this:

Blah bla blah some text . 2 Year 2011 Company name , Company disclaimer Date 5 january 2011 Blah blah blah text continues .

Now I want to remove the headers (if present) and split the string in two (one sentence before the header, other sentence after the header.

The dates in the header differ, but it always...

starts with a page number, followed by 'Year' and the year's number;
ends with: 'Date' + (int) + (string) + (int).

Would there be a regular expression to recognize this header and delete it?

m.cekiera · Accepted Answer

Try with:

\d+\sYear\s\d{4}[\w\s,]+?Date\s\d+\s\w+\s\d{4}

DEMO

however depends on text content, there could be fragments which also match. So maybe a longer example will be needed.

Python: regex to recognize document header

Answers (2)

Related Questions