Reputation: 9816

Simple tokenization issue in NTLK

I want to tokenize the following text:

In Düsseldorf I took my hat off. But I can't put it back on.


'In', 'Düsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', 
'can't', 'put', 'it', 'back', 'on', '.'

But to my surprise none of the NLTK tokenizers work. How can I accomplish did? Is it possible to use a combination of these tokenizers somehow to achieve the above?

Upvotes: 1

Answers (2)

alvas

Reputation: 122240

You should tokenize the sentence before tokenizing the words:

>>> from nltk import sent_tokenize, word_tokenize
>>> text = "In Düsseldorf I took my hat off. But I can't put it back on."
>>> text = [word_tokenize(s) for s in sent_tokenize(text)]
>>> text
[['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.'], ['But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']]

If you want to get them back into a single list:

>>> from itertools import chain
>>> from nltk import sent_tokenize, word_tokenize
>>> text = "In Düsseldorf I took my hat off. But I can't put it back on."
>>> text = [word_tokenize(s) for s in sent_tokenize(text)]
>>> text
[['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.'], ['But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']]
>>> list(chain(*text))
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']

If you must put the ["ca", "n't"] -> ["can't"]:

>>> from itertools import izip_longest, chain
>>> tok_text = list(chain(*[word_tokenize(s) for s in sent_tokenize(text)]))
>>> contractions = ["n't", "'ll", "'re", "'s"]

# Iterate through two words at a time and then join the contractions back.
>>> [w1+w2 if w2 in contractions else w1 for w1,w2 in izip_longest(tok_text, tok_text[1:])]
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', "can't", "n't", 'put', 'it', 'back', 'on', '.']
# Remove all contraction tokens since you've joint them to their root stem.
>>> [w for w in [w1+w2 if w2 in contractions else w1 for w1,w2 in izip_longest(tok_text, tok_text[1:])] if w not in contractions]
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', "can't", 'put', 'it', 'back', 'on', '.']

Upvotes: 1

Roman Kutlak

Reputation: 2784

You can take one of the tokenizers as a starting point and then fix the contractions (assuming that is the problem):

from nltk.tokenize.treebank import TreebankWordTokenizer

text = "In Düsseldorf I took my hat off. But I can't put it back on."
tokens = TreebankWordTokenizer().tokenize(text)

contractions = ["n't", "'ll", "'m"]
fix = []
for i in range(len(tokens)):
    for c in contractions:
        if tokens[i] == c: fix.append(i)

fix_offset = 0
for fix_id in fix:
    idx = fix_id - 1 - fix_offset
    tokens[idx] = tokens[idx] + tokens[idx+1]
    del tokens[idx+1]
    fix_offset += 1

print(tokens)

>>>['In', 'Düsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', "can't", 'put', 'it', 'back', 'on', '.']

Upvotes: 2

Simple tokenization issue in NTLK

Answers (2)

Related Questions