Reputation: 395
I have a string like this:
THIS IS UPPERCASE TEXT :PART 1 - PARAGRAPHLorem ipsum:1.1First phrase «test».1.2Second phrase «test» end of phrase.
I would like to have this output (this segmentation with Spacy):
"THIS IS UPPERCASE TEXT :PART 1 - PARAGRAPH" "Lorem ipsum:1.1First phrase «test»." "1.2Second phrase «test» end of phrase."
I tried this with Spacy:
import spacy
from spacy.language import Language
import re
nlp = spacy.load('fr_core_news_lg')
boundary = re.compile('^[0-9]$')
@Language.component('custom_seg')
def custom_seg(doc):
prev = doc[0].text
length = len(doc)
for index, token in enumerate(doc):
if (token.text == '.' and boundary.match(prev) and index!=(length - 1)):
doc[index+1].sent_start = False
prev = token.text
return doc
nlp.add_pipe('custom_seg', before='parser')
test = "THIS IS UPPERCASE TEXT :PART 1 - PARAGRAPHLorem ipsum:1.1First phrase «test».1.2Second phrase «test» end of phrase."
doc = nlp(test)
for sentence in doc.sents:
print("Length " + str(len(sentence.text))), print(sentence.text), print('____________')
But the output is:
Length 4
THIS
____________
Length 12
IS UPPERCASE
____________
Length 12
TEXT :PART 1
____________
Length 1
-
____________
Length 29
PARAGRAPHLorem ipsum:1.1First
____________
Length 8
phrase «
____________
Length 24
test».1.2Second phrase «
____________
Length 20
test» end of phrase.
____________
I don't know where I am wrong. I don't understand why I have these segmentations and how to improve it.
Upvotes: 0
Views: 118
Reputation: 15593
First problem: the tokenizer has no way of knowing that it should split PARAGRAPHLorem
into two tokens. I'm not even sure how you could tell it about that without using a tokenizer that checks all possible tokenizations. So you're going to have a very hard time working around that.
Second problem: Your split conditions are weird. Do you want a number like 1.1
to always mark a new sentence or not? If you know where your sentences should be split and can assert the starts then that's something you can implement, but trying to just prevent splits at specific places doesn't work reliably (partly due to a bug).
Upvotes: 1