Reputation: 117
I am parsing sentences with the NLTK Punkt tokenizer. But some specific abbreviations are causing sentences to split in the wrong locations.
For example:
"Hello, good day. Said the dog, all canines understood the dog(Wolfs, etc.) the message."
The parser splits it up like this:
'Hello, good day.'
'Said the dog, all canines understood the dog(Wolfs, etc.)'
'the message.'
But I need to be like this:
'Hello, good day.'
'Said the dog, all canines understood the dog(Wolfs, etc.) the message.'
My code:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
def parser(text):
punkt_param = PunktParameters()
abbreviation = ['u.s.a', 'e.g', 'u.s']
punkt_param.abbrev_types = set(abbreviation)
# Training a new model with the text.
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.train(text)
# It automatically learns the abbreviations.
tokenizer._params.abbrev_types
# Use the customized tokenizer.
sentences = tokenizer.tokenize(text)
I cannot simply add "etc" to the list of abbreviations, since it sometimes occurs at the end of sentences.
Upvotes: 1
Views: 2622
Reputation: 24154
The Punkt tokenizer can be trained to recognize "etc." in the middle of a sentence, or at the end of a sentence.
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
trainer = PunktTrainer()
corpus = """
It can take a few examples to learn a new abbreviation, e.g., when parsing a list like 1, 2, 3, etc., and then recognizing "etc".
Warehouses, cellars, and vaults, etc., may all be used for long-term storage.
Sometimes an abbreviation can occur at the end of a sentence, such as etc.
And then it needs to split at the end.
"""
trainer.train(corpus, finalize=False, verbose=True)
abbreviations = "u.s.a., e.g., u.s."
trainer.train(abbreviations, finalize=False, verbose=True)
tokenizer = PunktSentenceTokenizer(trainer.get_params())
text = "Hello, good day. Said the dog, all canines understood the dog(Wolfs, etc.) the message. Abbreviations can be tricky at the end, or final position etc. The question becomes whether or not the tokenizer can spot the difference."
sentences = tokenizer.tokenize(text)
for sentence in sentences:
print(sentence)
Abbreviation: [1.2711] e.g
Rare Abbrev: etc.
Abbreviation: [1.1134] u.s
Abbreviation: [0.6144] u.s.a
Abbreviation: [2.2269] e.g
Hello, good day.
Said the dog, all canines understood the dog(Wolfs, etc.) the message.
Abbreviations can be tricky at the end, or final position etc.
The question becomes whether or not the tokenizer can spot the difference.
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
trainer = PunktTrainer()
corpus = """
It can take a few examples to learn a new abbreviation, e.g., when parsing a list like 1, 2, 3, etc., and then recognizing "etc".
Warehouses, cellars, and vaults, etc., may all be used for long-term storage.
"""
trainer.train(corpus, finalize=False, verbose=True)
abbreviations = "u.s.a., e.g., u.s."
trainer.train(abbreviations, finalize=False, verbose=True)
tokenizer = PunktSentenceTokenizer(trainer.get_params())
text = "Hello, good day. Said the dog, all canines understood the dog(Wolfs, etc.) the message. Abbreviations can be tricky at the end, or final position etc. The question becomes whether or not the tokenizer can spot the difference."
sentences = tokenizer.tokenize(text)
for sentence in sentences:
print(sentence)
Abbreviation: [1.2410] e.g
Rare Abbrev: etc.
Abbreviation: [1.0382] u.s
Abbreviation: [0.5729] u.s.a
Abbreviation: [2.0764] e.g
Hello, good day.
Said the dog, all canines understood the dog(Wolfs, etc.) the message.
Abbreviations can be tricky at the end, or final position etc. The question becomes whether or not the tokenizer can spot the difference.
Upvotes: 3