Reputation: 79
I am using spacy for text mining for one of my project. Is there any way to avoid sentence segmentation from Coordinating Conjunction(and , or, yet etc..) without using custom segmentation.
document = "I love swimming and i love playing badminton too"
doc = nlp(document)
for sent in doc.sents:
print(sent)
Output:
I love swimming
and i love playing badminton too
Expected output:
I love swimming and i love playing badminton too
Upvotes: 1
Views: 662
Reputation: 53
Downgrading spacy didn't solve it in my case and I had troubles with creating a new virtual environment, but this worked for me: https://spacy.io/usage/processing-pipelines#component-example1
In your case:
import spacy
def custom_sentencizer(doc):
for i, token in enumerate(doc[:-2]):
# Define sentence start if pipe + titlecase token
if token.text in [".", "!", "?"] and doc[i+1].is_title:
doc[i+1].is_sent_start = True
else:
# Explicitly set sentence start to False otherwise, to tell
# the parser to leave those tokens alone
doc[i+1].is_sent_start = False
return doc
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(custom_sentencizer, before="parser") # Insert before the parser
document = "I love swimming and i love playing badminton too. I love swimming and i love playing badminton too! I love swimming and i love playing badminton too."
doc = nlp(document)
for sent in doc.sents:
print(sent,'\n')
Output:
I love swimming and i love playing badminton too.
I love swimming and i love playing badminton too!
I love swimming and i love playing badminton too.
Upvotes: 2
Reputation: 1442
Downgrade to spaCy 2.3.0
and en_core_web_sm 2.3.0
. The latest version's of spaCy are not stable.
pip install spacy==2.3.0
python -m spacy downoad en_core_web_sm
If you already have spaCy installed in your virtual environment, then you should delete and create a new virtual environment as spaCy come with a lot of dependencies, and it is not easy to zero down on what is causing the issue.
Upvotes: 0