Reputation: 143
Having a simple tokenizer, which works well for the test files i need to show it on, in the following code:
import re, sys
for line in sys.stdin:
for token in re.findall("(\w+\.\w+\.[\w.]*|\w+[-.]\w+|[-]+|'s|[,;:.!?\"%']|\w+)", line.strip()):
print(token)
Text like This house is small. That house is big. turns correctly to:
This
house
is
small
.
That
house
is
big
.
However, i also need to insert a blank line between sentences:
···
small
.
That
···
So i’ve written another loop
for token in re.sub("([\"\.!?])\s([\"`]+|[A-Z]+\w*)", "\\1\n\n\\2", line):
with a regexp
which catches almost all sentence breaks in the test texts that i need to use, but i’m having trouble in actually connecting it to the code. Putting it inside the first for loop
, which feels most logical to me, breaks the output completely. Also tried with some if clauses
, but that doesn’t work either.
Upvotes: 1
Views: 860
Reputation: 19289
DetectorMorse is an open-source sentence segmenter by Kyle Gorman with state-of-the-art performance on formal business English sentences (WSJ articles). It uses simple regexes as an initial filter but then deals with the remaining 10% of difficult cases with a single-layer perceptron. So it can be trained to perform well in domains other than WSJ English.
Sentence boundary detection (and segmentation) is an area of active research and continual refinement. I don't think there exists a regular expression that can reliably detect sentences and sentence boundaries. In addition regular expressions can't easily tell you how "confident" they are in a sentence boundary. And they can't be retrained on a new vocabulary, language, dialect, or writing style. Some examples I can think of that would break many regular expressions:
And this doesn't even begin to address various informal English or foreign language grammars like Creole, chat messages, urban slang, etc.
English (or any natural language) is an empirically defined language (or "historically defined") where grammar and punctuation rules depend on the experience of the humans doing the communicating. And this experience history "time window" is adjustable based on context, geographic location, and even individual "theories of mind" about the audience/reader. Even children develop their own "secret" languages at an early age. Humans make and break and evolve the rules of their language according to the people they communicate with in a particular domain, geographic region, etc.
So the state of the art for accuracy in sentence segmentation must also be "fuzzy" and empirically defined (e.g. Machine Learning) within your domain (a set of training examples from "your world") if accuracy is important to you.
Upvotes: 0
Reputation: 474221
Non-regex solution using a combination of sent_tokenize()
and word_tokenize()
from nltk
:
from nltk.tokenize import word_tokenize, sent_tokenize
s = "This house is small. That house is big."
for t in sent_tokenize(s):
for word in word_tokenize(t):
print(word)
print
Prints:
This
house
is
small
.
That
house
is
big
.
Upvotes: 3
Reputation: 1431
Here's a simpler approach, that works for the example that you gave. If the more complex regex is needed it can be added back in:
import re
mystr = "This house is small. That house is big."
for token in re.findall(r"([\w]+|[^\s])", mystr):
print (token)
if re.match(r"[.!?]", token):
print()
I'm not quite clear how you expect to handle punctuation within sentences, and which punctuation terminates a sentence, so it would likely have to be modified a little.
Upvotes: 2