Max Luwang
Max Luwang

Reputation: 79

How to avoid sentence segmentation from conjunction in spacy

I am using spacy for text mining for one of my project. Is there any way to avoid sentence segmentation from Coordinating Conjunction(and , or, yet etc..) without using custom segmentation.

document = "I love swimming and i love playing badminton too"
doc = nlp(document)
for sent in doc.sents:
    print(sent)

Output:

I love swimming 
and i love playing badminton too

Expected output:

I love swimming and i love playing badminton too

Upvotes: 1

Views: 662

Answers (2)

Matthias S
Matthias S

Reputation: 53

Downgrading spacy didn't solve it in my case and I had troubles with creating a new virtual environment, but this worked for me: https://spacy.io/usage/processing-pipelines#component-example1

In your case:

import spacy

def custom_sentencizer(doc):
    for i, token in enumerate(doc[:-2]):
        # Define sentence start if pipe + titlecase token
        if token.text in [".", "!", "?"] and doc[i+1].is_title:
            doc[i+1].is_sent_start = True
        else:
            # Explicitly set sentence start to False otherwise, to tell
            # the parser to leave those tokens alone
            doc[i+1].is_sent_start = False
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(custom_sentencizer, before="parser")  # Insert before the parser

document = "I love swimming and i love playing badminton too. I love swimming and i love playing badminton too! I love swimming and i love playing badminton too."
doc = nlp(document)
for sent in doc.sents:
    print(sent,'\n')

Output:

I love swimming and i love playing badminton too.

I love swimming and i love playing badminton too!

I love swimming and i love playing badminton too.

Upvotes: 2

Raqib
Raqib

Reputation: 1442

Downgrade to spaCy 2.3.0 and en_core_web_sm 2.3.0. The latest version's of spaCy are not stable.

pip install spacy==2.3.0
python -m spacy downoad en_core_web_sm

If you already have spaCy installed in your virtual environment, then you should delete and create a new virtual environment as spaCy come with a lot of dependencies, and it is not easy to zero down on what is causing the issue.

Upvotes: 0

Related Questions