Reputation: 1038
I have a (running) text with many sentences. I have a regular expression that is able to extract the sentences that are terminated by a period, question or exclamation mark. The end of a sentence must be followed by a beginning of the next sentence ( white spaces/tabs/new lines and a capital letter or number). I read a string stored in data and a call the regex.
basic_pat = re.compile(r"[(']?\w.+[)']?[?.!](?=\s+[A-Z\d])")
result = basic_pat.findall(data)
This regex seems to be working if we do not take into consideration the abbreviation cases. In the text I may also have some chapter texts that do not end with a period. For example:
This is the first chapter
Here is the first sentence. Here is the second sentence.Here ids the third sent. Here is the fourth sent...
My question is if it is possible to have one regex that reads only the chapter texts as well as a regex that reads the sentences. The chapters are loose text in a line without a period. Regular sentences may cover several lines. That is, sentences may also have text in a line without period. Is it possible to differentiate the two situations (chapter vs sentences) with regex?
Upvotes: 0
Views: 280
Reputation: 44436
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski
Actually, what you should do is use two regular expressions (now you'll have four problems).
First, go through and break up the text into alternating chapter-headers and not-chapter-headings. Then examine each not-chapter-heading for sentences, paragraphs, and what have you.
How would you break up the following:
Visiting Leipzig, Chapter One: Thomaskirchhof St.
The Bach Museum is on Thomaskirchhof opposite St. Thomas's Church. van Beethoven doesn't have a museum anywhere in Leipzig.
Processing natural language is devilishly difficult. God did a thorough job when He destroyed the Tower of Babel.
Upvotes: 3