How to get NLTK Punkt sentence tokenizer to recognize abbreviations that occur in the middle or end of a sentence?

I am parsing sentences with the NLTK Punkt tokenizer. But some specific abbreviations are causing sentences to split in the wrong locations.

For example:

"Hello, good day. Said the dog, all canines understood the dog(Wolfs, etc.) the message."

The parser splits it up like this:

'Hello, good day.'
'Said the dog, all canines understood the dog(Wolfs, etc.)'
'the message.'

But I need to be like this:

'Hello, good day.'
'Said the dog, all canines understood the dog(Wolfs, etc.) the message.'

My code:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters

def parser(text):        
    punkt_param = PunktParameters()
    abbreviation = ['u.s.a', 'e.g', 'u.s']

    punkt_param.abbrev_types = set(abbreviation)
    # Training a new model with the text.
    tokenizer = PunktSentenceTokenizer(punkt_param)
    tokenizer.train(text)

    # It automatically learns the abbreviations.
    tokenizer._params.abbrev_types

    # Use the customized tokenizer.
    sentences = tokenizer.tokenize(text)

I cannot simply add "etc" to the list of abbreviations, since it sometimes occurs at the end of sentences.

Upvotes: 1

Views: 2622

Answers (1)

Christopher Peisert
Christopher Peisert

Reputation: 24154

The Punkt tokenizer can be trained to recognize "etc." in the middle of a sentence, or at the end of a sentence.

Training example to identifying "etc"

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer

trainer = PunktTrainer()
corpus = """
It can take a few examples to learn a new abbreviation, e.g., when parsing a list like 1, 2, 3, etc., and then recognizing "etc". 
Warehouses, cellars, and vaults, etc., may all be used for long-term storage.
Sometimes an abbreviation can occur at the end of a sentence, such as etc.
And then it needs to split at the end.
"""
trainer.train(corpus, finalize=False, verbose=True)

abbreviations = "u.s.a., e.g., u.s."
trainer.train(abbreviations, finalize=False, verbose=True)

tokenizer = PunktSentenceTokenizer(trainer.get_params())

text = "Hello, good day. Said the dog, all canines understood the dog(Wolfs, etc.) the message. Abbreviations can be tricky at the end, or final position etc. The question becomes whether or not the tokenizer can spot the difference."

sentences = tokenizer.tokenize(text)
for sentence in sentences:
    print(sentence)

Output

  Abbreviation: [1.2711] e.g
  Rare Abbrev: etc.
  Abbreviation: [1.1134] u.s
  Abbreviation: [0.6144] u.s.a
  Abbreviation: [2.2269] e.g
Hello, good day.
Said the dog, all canines understood the dog(Wolfs, etc.) the message.
Abbreviations can be tricky at the end, or final position etc.
The question becomes whether or not the tokenizer can spot the difference.

Example of insufficient training data to recognize "etc" at end of sentence

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer

trainer = PunktTrainer()
corpus = """
It can take a few examples to learn a new abbreviation, e.g., when parsing a list like 1, 2, 3, etc., and then recognizing "etc". 
Warehouses, cellars, and vaults, etc., may all be used for long-term storage.
"""
trainer.train(corpus, finalize=False, verbose=True)

abbreviations = "u.s.a., e.g., u.s."
trainer.train(abbreviations, finalize=False, verbose=True)

tokenizer = PunktSentenceTokenizer(trainer.get_params())

text = "Hello, good day. Said the dog, all canines understood the dog(Wolfs, etc.) the message. Abbreviations can be tricky at the end, or final position etc. The question becomes whether or not the tokenizer can spot the difference."

sentences = tokenizer.tokenize(text)
for sentence in sentences:
    print(sentence)

Output

  Abbreviation: [1.2410] e.g
  Rare Abbrev: etc.
  Abbreviation: [1.0382] u.s
  Abbreviation: [0.5729] u.s.a
  Abbreviation: [2.0764] e.g
Hello, good day.
Said the dog, all canines understood the dog(Wolfs, etc.) the message.
Abbreviations can be tricky at the end, or final position etc. The question becomes whether or not the tokenizer can spot the difference.

Upvotes: 3

Related Questions