vaulttech
vaulttech

Reputation: 505

How does spaCy tokenizer splits sentences?

I am finding the tokenization code quite complicated and I still couldn't find where in the code the sentences are split.

For example, how does the tokenizer know that

Mr. Smitt stayed at home. He was tired

should not be split in "Mr." and should be split before "He".? And where in the code does the split before "He" happens?

(In fact, I am unsure actually unsure if I am looking at the right place: if I search for sents in tokenizer.pyx I don't find any occurrence)

Upvotes: 4

Views: 2814

Answers (1)

simbamford
simbamford

Reputation: 71

You access the splits via the doc object, with the generator:

doc.sents

The output of the generator is a series of spans.

As for how the splits are chosen, the document is parsed for dependency relationships. Understanding the parser is not trivial - you'll have to read into it if you want to understand it - it's using a neural network to inform the decision about how to construct the dependency trees; but the splits are those gaps between tokens which are not crossed by dependencies. This is not simply where you find a full-stop, and the method is more robust as a result.

Upvotes: 1

Related Questions