jos97
jos97

Reputation: 395

Spacy: how make a clean segmentation?

I have a string like this:

THIS IS UPPERCASE TEXT :PART 1 - PARAGRAPHLorem ipsum:1.1First phrase «test».1.2Second phrase «test» end of phrase.

I would like to have this output (this segmentation with Spacy):

"THIS IS UPPERCASE TEXT :PART 1 - PARAGRAPH" "Lorem ipsum:1.1First phrase «test»." "1.2Second phrase «test» end of phrase."

I tried this with Spacy:

import spacy
from spacy.language import Language
import re

nlp = spacy.load('fr_core_news_lg')
boundary = re.compile('^[0-9]$')

@Language.component('custom_seg')

def custom_seg(doc):
    prev = doc[0].text
    length = len(doc)
    for index, token in enumerate(doc):
        if (token.text == '.' and boundary.match(prev) and index!=(length - 1)):
            doc[index+1].sent_start = False
        prev = token.text
    return doc

nlp.add_pipe('custom_seg', before='parser')

test = "THIS IS UPPERCASE TEXT :PART 1 - PARAGRAPHLorem ipsum:1.1First phrase «test».1.2Second phrase «test» end of phrase."
doc = nlp(test)

for sentence in doc.sents:
    print("Length " + str(len(sentence.text))), print(sentence.text), print('____________')

But the output is:

    Length 4
    THIS
    ____________
    Length 12
    IS UPPERCASE
    ____________
    Length 12
    TEXT :PART 1
    ____________
    Length 1
    -
    ____________
    Length 29
    PARAGRAPHLorem ipsum:1.1First
    ____________
    Length 8
    phrase «
    ____________
    Length 24
    test».1.2Second phrase «
    ____________
    Length 20
    test» end of phrase.
    ____________

I don't know where I am wrong. I don't understand why I have these segmentations and how to improve it.

Upvotes: 0

Views: 118

Answers (1)

polm23
polm23

Reputation: 15593

First problem: the tokenizer has no way of knowing that it should split PARAGRAPHLorem into two tokens. I'm not even sure how you could tell it about that without using a tokenizer that checks all possible tokenizations. So you're going to have a very hard time working around that.

Second problem: Your split conditions are weird. Do you want a number like 1.1 to always mark a new sentence or not? If you know where your sentences should be split and can assert the starts then that's something you can implement, but trying to just prevent splits at specific places doesn't work reliably (partly due to a bug).

Upvotes: 1

Related Questions