revy
revy

Reputation: 4727

Python nltk incorrect sentence tokenization with custom abbrevations

I am using nltk tokenize library to split up english sentences. Many sentences contain abbreviations such as e.g. or eg. thus I updated the tokenizer with these custom abbreviations. I found a strange tokenization behaviour with a sentence though:

import nltk

nltk.download("punkt")
sentence_tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")

extra_abbreviations = ['e.g', 'eg']
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)

line = 'Required experience with client frameworks (e.g. React, Vue.js) and testing (e.g. Karma, Tape)'

for s in sentence_tokenizer.tokenize(line):
    print(s)

# OUTPUT
# Required experience with client frameworks (e.g. React, Vue.js) and testing (e.g.
# Karma, Tape)

So as you can see the tokenizer does not split on the first abbreviation (correct) but it does on the second (incorrect).

The weird thing is that if I change the word Karma in anything else, it works correctly.

import nltk

nltk.download("punkt")
sentence_tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")

extra_abbreviations = ['e.g', 'eg']
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)

line = 'Required experience with client frameworks (e.g. React, Vue.js) and testing (e.g. SomethingElse, Tape)'

for s in sentence_tokenizer.tokenize(line):
    print(s)

# OUTPUT
# Required experience with client frameworks (e.g. React, Vue.js) and testing (e.g. SomethingElse, Tape)

Any clue why is this happening?

Upvotes: 3

Views: 1045

Answers (1)

gz.
gz.

Reputation: 6711

You can see why punkt is making the break choices it is using the debug_decisions method.

>>> for d in sentence_tokenizer.debug_decisions(line):
...     print(nltk.tokenize.punkt.format_debug_decision(d))
... 
Text: '(e.g. React,' (at offset 47)
Sentence break? None (default decision)
Collocation? False
'e.g.':
    known abbreviation: True
    is initial: False
'react':
    known sentence starter: False
    orthographic heuristic suggests is a sentence starter? unknown
    orthographic contexts in training: {'MID-UC', 'MID-LC'}

Text: '(e.g. Karma,' (at offset 80)
Sentence break? True (abbreviation + orthographic heuristic)
Collocation? False
'e.g.':
    known abbreviation: True
    is initial: False
'karma':
    known sentence starter: False
    orthographic heuristic suggests is a sentence starter? True
    orthographic contexts in training: {'MID-LC'}

This tells us in the corpus used for training, both 'react' and 'React' appear in the middle of sentences, so it does not break before 'React' in your line. However, only 'karma' in lowercase form occurs, so it considers this a likely sentence start point.

Note, this is in line with the documentation for the library:

However, Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text.

PunktTrainer learns parameters such as a list of abbreviations (without supervision) from portions of text. Using a PunktTrainer directly allows for incremental training and modification of the hyper-parameters used to decide what is considered an abbreviation, etc.

So, while a quick hack for this particular case is tweaking the private _params futher to say 'Karma' also may appear mid-sentence:

>>> sentence_tokenizer._params.ortho_context['karma'] |= nltk.tokenize.punkt._ORTHO_MID_UC
>>> sentence_tokenizer.tokenize(line)
['Required experience with client frameworks (e.g. React, Vue.js) and testing (e.g. Karma, Tape)']

Instead maybe you should add additional training data from CVs that include all these library names:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
trainer = PunktTrainer()
# tweak trainer params here if helpful
trainer.train(my_corpus_of_concatted_tech_cvs)
sentence_tokenizer = PunktSentenceTokenizer(trainer.get_params())

Upvotes: 3

Related Questions