Reputation: 965
I am having a bit of a trouble correctly identifying sentences in a text for specific corner cases:
"
are involved.This is how I identify sentences in text so far (source: Subtitles Reformat to end with complete sentence):
re.findall
part basically looks for a chunk of str
that starts with a capital letter, [A-Z]
, then anything except the punctuation, then ends with the punctuation, [\.?!]
.
import re
text = "We were able to respond to the first research question. Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
print(sentence + "\n")
We were able to respond to the first research question. Next, we also determined the size of the population.
Corner Case 1: Dot, Dot, Dot
The dot,dot,dot, is not kept, since there are no instructions given for what to do if three dots appear in a row. How could this be changed ?
text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
print(sentence + "\n")
We were able to respond to the first research question. Next, we also determined the size of the population.
Corner Case 2: "
The "
symbol is successfully kept within a sentence, but like the dot's following the punctuation, it will be deleted at the end.
text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
print(sentence + "\n")
We were able to respond to the first "research" question: "What is this? Next, we also determined the size of the population.
Corner Case 3: lower case start of a sentence
If a sentence accidentally starts with a lower case, the sentence will be ignored. The aim would be to identify that a previous sentence ended (or the text just started) and hence a new sentence has to start.
text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
print(sentence + "\n")
We were able to respond to the first research question.
I tested it:
import spacy
from spacy.lang.en import English
raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
...but I get:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-157-4fd093d3402b> in <module>() 6 nlp = English() 7 doc = nlp(raw_text) ----> 8 sentences = [sent.string.strip() for sent in doc.sents] <ipython-input-157-4fd093d3402b> in <listcomp>(.0) 6 nlp = English() 7 doc = nlp(raw_text) ----> 8 sentences = [sent.string.strip() for sent in doc.sents] doc.pyx in sents() ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.
Upvotes: 3
Views: 5104
Reputation: 579
Responding to the Edited question:
I think the code you are using is for an older version of spacy. For Spacy3.0 you need to download the en_core_web_sm model first:
python -m spacy download en_core_web_sm
then the following solution should work:
raw_text = 'Hello, world. Here are two sentences.'
nlp = spacy.load("en_core_web_sm")
doc = nlp(raw_text)
sentences = [sent for sent in doc.sents]
print(sentences)
Output -
[Hello, world., Here are two sentences.]
Upvotes: 0
Reputation: 6099
You could modify your regex to match your corner cases.
First of all, you do not need to escape .
inside []
For the first corner case, you can greedily match the ending-sentance-token with [.!?]*
For the second, you can potentially match "
after [.!?]
For the last one, you can start your sentance with either upper or lower :
import re
regex = r'([A-z][^.!?]*[.!?]*"?)'
text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
print(sentence)
print()
text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
print(sentence)
print()
text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(regex, text):
print(sentence)
[A-z]
, every match should start with a letter, either upper or lower.[^.?!]*
, it matches greedily any character which is not .
, ?
or !
(an ending sentance character)[.?!]*
, it matches greedily the ending characters, so ...??!!???
will be match as part of the sentance"?
, it eventually matches a quote at the end of the sentanceCorner case 1:
We were able to respond to the first research question... Next, we also determined the size of the population.
Corner case 2:
We were able to respond to the first "research" question: "What is this?" Next, we also determined the size of the population.
Corner case 3:
We were able to respond to the first research question. next, we also determined the size of the population.
Upvotes: 5
Reputation: 2171
You can use some of the industrial packages for that. For example, spacy has a very good sentence tokenizer.
from __future__ import unicode_literals, print_function
from spacy.en import English
raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
Your scenarios:
case result -> ['We were able to respond to the first research question...', 'Next, we also determined the size of the population.']
case result -> ['We were able to respond to the first "research" question: "What is this?"', 'Next, we also determined the size of the population.']
case result -> ['We were able to respond to the first research question.', 'next, we also determined the size of the population.']
Upvotes: 2
Reputation: 53
Try this regex: ([A-Z][^.!?]*[.!?]+["]?)
'+' means one or more
'?' means zero or more
This should pass all 3 corner cases you mentioned above
Upvotes: 0