Reputation: 5474
Say I have a corpus of annotated text where a sentence looks something like:
txt = 'red foxes <emotion>scare</emption> me.'
is it possible to tokenize this using word_tokenize in such as way that we get:
['red', 'foxes', '<emotion>scare<emotion>', 'me', '.']
We could use an alternative annotation scheme say:
txt = 'red foxes scare\_EMOTION me'
Is it possible to do this with NLTK -- currently I'm parsing out the annotations and then tracking them out of band and it is very cumbersome.
Upvotes: 1
Views: 181
Reputation: 10794
To achieve the desired result you don't need nltk
.
Just run txt.split()
If you insist on using nltk
, check out the different tokenizers.
PunktWordTokenizer
and WhitespaceTokenizer
fit.
Upvotes: 2