Muhammad Zeeshan
Muhammad Zeeshan

Reputation: 89

Erro in importing 'PunktWordTokenizer'

I am trying to tokenize sentences using nltk.tokenize, but the following error occurs when I run the code:

cannot import name 'PunktWordTokenizer'.

I tried to find a solution from different sources but couldn't come up with any solution. I tried using github-issue but without success.

from nltk.tokenize import PunktWordTokenizer
tokenizer = PunktWordTokenizer()
tokenizer.tokenize("Can't is a contraction.")

I expected the tokenised sentences but an error occurred.

Upvotes: 4

Views: 851

Answers (1)

BoarGules
BoarGules

Reputation: 16942

It isn't clear which tokenizer you want. There isn't one called PunktWordTokenizer anymore. It was internal and was not intended to be public. Which is why you can't import that name. The two classes with names closest to that are called WordPunctTokenizer and PunktSentenceTokenizer.

Import the right name and it will work:

>>> import nltk
>>> from nltk.tokenize import WordPunctTokenizer
>>> tokenizer = WordPunctTokenizer()
>>> tokenizer.tokenize("Can't is a contraction.")
['Can', "'", 't', 'is', 'a', 'contraction', '.']

Since you say you are looking for tokenized sentences, then maybe it is the other you want:

>>> from nltk.tokenize import PunktSentenceTokenizer
>>> tokenizer = PunktSentenceTokenizer()
>>> tokenizer.tokenize("Can't is a contraction.")
["Can't is a contraction."]
>>> tokenizer.tokenize("Can't is a contraction. So is hadn't.")
["Can't is a contraction.", "So is hadn't."]

Upvotes: 2

Related Questions