Reputation: 13123
I would like to tokenise Spanish sentences into words. Is the following the correct approach or is there a better way of doing this?
import nltk
from nltk.tokenize import word_tokenize
def spanish_word_tokenize(s):
for w in word_tokenize(s):
if w[0] in ("¿","¡"):
yield w[0]
yield w[1:]
else:
yield w
sentences = "¿Quién eres tú? ¡Hola! ¿Dónde estoy?"
spanish_sentence_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle')
sentences = spanish_sentence_tokenizer.tokenize(sentences)
for s in sentences:
print([s for s in spanish_word_tokenize(s)])
Upvotes: 4
Views: 6803
Reputation: 8642
There is a simpler solution by using spacy. However, only works previous download of spacy data: python -m spacy download es
import spacy
nlp = spacy.load('es')
sentences = "¿Quién eres tú? ¡Hola! ¿Dónde estoy?"
doc = nlp(sentences)
tokens = [token for token in doc]
print(tokens)
Gives a correct answer:
[¿, Quién, eres, tú, ?, ¡, Hola, !, ¿, Dónde, estoy, ?]
I don't recommend nltk ToktokTokenizer since according to the documentation "the input must be one sentence per line; thus only final period is tokenized." so you have to worry about segment by sentences first.
Upvotes: 3
Reputation: 122042
C.f. NLTK github issue #1214, there are quite a few alternative tokenizers in NLTK =)
E.g. using NLTK port of @jonsafari toktok tokenizer:
>>> import nltk
>>> nltk.download('perluniprops')
[nltk_data] Downloading package perluniprops to
[nltk_data] /Users/liling.tan/nltk_data...
[nltk_data] Package perluniprops is already up-to-date!
True
>>> nltk.download('nonbreaking_prefixes')
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data] /Users/liling.tan/nltk_data...
[nltk_data] Package nonbreaking_prefixes is already up-to-date!
True
>>> from nltk.tokenize.toktok import ToktokTokenizer
>>> toktok = ToktokTokenizer()
>>> sent = u"¿Quién eres tú? ¡Hola! ¿Dónde estoy?"
>>> toktok.tokenize(sent)
[u'\xbf', u'Qui\xe9n', u'eres', u't\xfa', u'?', u'\xa1Hola', u'!', u'\xbf', u'D\xf3nde', u'estoy', u'?']
>>> print " ".join(toktok.tokenize(sent))
¿ Quién eres tú ? ¡Hola ! ¿ Dónde estoy ?
>>> from nltk import sent_tokenize
>>> sentences = u"¿Quién eres tú? ¡Hola! ¿Dónde estoy?"
>>> [toktok.tokenize(sent) for sent in sent_tokenize(sentences, language='spanish')]
[[u'\xbf', u'Qui\xe9n', u'eres', u't\xfa', u'?'], [u'\xa1Hola', u'!'], [u'\xbf', u'D\xf3nde', u'estoy', u'?']]
>>> print '\n'.join([' '.join(toktok.tokenize(sent)) for sent in sent_tokenize(sentences, language='spanish')])
¿ Quién eres tú ?
¡Hola !
¿ Dónde estoy ?
If you hack the code a little and add u'\xa1'
in https://github.com/nltk/nltk/blob/develop/nltk/tokenize/toktok.py#L51 , you should be able to get:
[[u'\xbf', u'Qui\xe9n', u'eres', u't\xfa', u'?'], [u'\xa1', u'Hola', u'!'], [u'\xbf', u'D\xf3nde', u'estoy', u'?']]
Upvotes: 2