Reputation: 4727
I am using nltk tokenize library to split up english sentences.
Many sentences contain abbreviations such as e.g.
or eg.
thus I updated the tokenizer with these custom abbreviations.
I found a strange tokenization behaviour with a sentence though:
import nltk
nltk.download("punkt")
sentence_tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
extra_abbreviations = ['e.g', 'eg']
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)
line = 'Required experience with client frameworks (e.g. React, Vue.js) and testing (e.g. Karma, Tape)'
for s in sentence_tokenizer.tokenize(line):
print(s)
# OUTPUT
# Required experience with client frameworks (e.g. React, Vue.js) and testing (e.g.
# Karma, Tape)
So as you can see the tokenizer does not split on the first abbreviation (correct) but it does on the second (incorrect).
The weird thing is that if I change the word Karma
in anything else, it works correctly.
import nltk
nltk.download("punkt")
sentence_tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
extra_abbreviations = ['e.g', 'eg']
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)
line = 'Required experience with client frameworks (e.g. React, Vue.js) and testing (e.g. SomethingElse, Tape)'
for s in sentence_tokenizer.tokenize(line):
print(s)
# OUTPUT
# Required experience with client frameworks (e.g. React, Vue.js) and testing (e.g. SomethingElse, Tape)
Any clue why is this happening?
Upvotes: 3
Views: 1045
Reputation: 6711
You can see why punkt is making the break choices it is using the debug_decisions
method.
>>> for d in sentence_tokenizer.debug_decisions(line):
... print(nltk.tokenize.punkt.format_debug_decision(d))
...
Text: '(e.g. React,' (at offset 47)
Sentence break? None (default decision)
Collocation? False
'e.g.':
known abbreviation: True
is initial: False
'react':
known sentence starter: False
orthographic heuristic suggests is a sentence starter? unknown
orthographic contexts in training: {'MID-UC', 'MID-LC'}
Text: '(e.g. Karma,' (at offset 80)
Sentence break? True (abbreviation + orthographic heuristic)
Collocation? False
'e.g.':
known abbreviation: True
is initial: False
'karma':
known sentence starter: False
orthographic heuristic suggests is a sentence starter? True
orthographic contexts in training: {'MID-LC'}
This tells us in the corpus used for training, both 'react' and 'React' appear in the middle of sentences, so it does not break before 'React' in your line. However, only 'karma' in lowercase form occurs, so it considers this a likely sentence start point.
Note, this is in line with the documentation for the library:
However, Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use
PunktSentenceTokenizer(text)
to learn parameters from the given text.
PunktTrainer
learns parameters such as a list of abbreviations (without supervision) from portions of text. Using aPunktTrainer
directly allows for incremental training and modification of the hyper-parameters used to decide what is considered an abbreviation, etc.
So, while a quick hack for this particular case is tweaking the private _params
futher to say 'Karma' also may appear mid-sentence:
>>> sentence_tokenizer._params.ortho_context['karma'] |= nltk.tokenize.punkt._ORTHO_MID_UC
>>> sentence_tokenizer.tokenize(line)
['Required experience with client frameworks (e.g. React, Vue.js) and testing (e.g. Karma, Tape)']
Instead maybe you should add additional training data from CVs that include all these library names:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
trainer = PunktTrainer()
# tweak trainer params here if helpful
trainer.train(my_corpus_of_concatted_tech_cvs)
sentence_tokenizer = PunktSentenceTokenizer(trainer.get_params())
Upvotes: 3