user1172468
user1172468

Reputation: 5474

using NLTK methods such as tokenize on annotated text

Say I have a corpus of annotated text where a sentence looks something like:

txt = 'red foxes <emotion>scare</emption> me.'

is it possible to tokenize this using word_tokenize in such as way that we get:

['red', 'foxes', '<emotion>scare<emotion>', 'me', '.']

We could use an alternative annotation scheme say:

txt = 'red foxes scare\_EMOTION me'

Is it possible to do this with NLTK -- currently I'm parsing out the annotations and then tracking them out of band and it is very cumbersome.

Upvotes: 1

Views: 181

Answers (1)

rokpoto.com
rokpoto.com

Reputation: 10794

To achieve the desired result you don't need nltk.

Just run txt.split()

If you insist on using nltk, check out the different tokenizers.

PunktWordTokenizer and WhitespaceTokenizer fit.

Upvotes: 2

Related Questions