Reputation: 3601
I have parsed a document into separate sentences, but some of the parsed sentences also contain the document's headers. This means a few sentences look like this:
Blah bla blah some text . 2 Year 2011 Company name , Company disclaimer Date 5 january 2011 Blah blah blah text continues .
Now I want to remove the headers (if present) and split the string in two (one sentence before the header, other sentence after the header.
The dates in the header differ, but it always...
Would there be a regular expression to recognize this header and delete it?
Upvotes: 0
Views: 442
Reputation: 5395
Try with:
\d+\sYear\s\d{4}[\w\s,]+?Date\s\d+\s\w+\s\d{4}
however depends on text content, there could be fragments which also match. So maybe a longer example will be needed.
Upvotes: 1
Reputation: 1249
You can use re.sub
providing an empty string as repl
parameter.
re.sub("\d+ Year \d{4}.*Date \d{1,2} (january|february) \d{4}", "", your_sentence)
Take a look at re.sub for more details.
you can also make use of Pythex to test the regular expression patterns.
Upvotes: 1