jss367
jss367

Reputation: 5381

NLTK tokenize text with dialog into sentences

I am able to tokenize non-dialog text into sentences but when I add quotation marks to the sentence the NLTK tokenizer doesn't split them up correctly. For example, this works as expected:

import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text1 = 'Is this one sentence? This is separate. This is a third he said.'
tokenizer.tokenize(text1)

This results in a list of three different sentences:

['Is this one sentence?', 'This is separate.', 'This is a third he said.']

However, if I make it into a dialogue, the same process doesn't work.

text2 = '“Is this one sentence?” “This is separate.” “This is a third” he said.'
tokenizer.tokenize(text2)

This returns it as a single sentence:

['“Is this one sentence?” “This is separate.” “This is a third” he said.']

How can I make the NLTK tokenizer work in this case?

Upvotes: 0

Views: 1216

Answers (1)

alexis
alexis

Reputation: 50200

It seems the tokenizer doesn't know what to do with the directed quotes. Replace them with regular ASCII double quotes and the example works fine.

>>> text3 = re.sub('[“”]', '"', text2)
>>> nltk.sent_tokenize(text3)
['"Is this one sentence?"', '"This is separate."', '"This is a third" he said.']

Upvotes: 3

Related Questions