unixsnob
unixsnob

Reputation: 1715

Python NLTK tokenizing text using already found bigrams

Background: I got a lot of text that has some technical expressions, which are not always standard.

I know how to find the bigrams and filter them.

Now, I want to use them when tokenizing the sentences. So words that should stay together (according to the calculated bigrams) are kept together.

I would like to know if there is a correct way to doing this within NLTK. If not, I can think of various non efficient ways of rejoining all the broken words by checking dictionaries.

Upvotes: 0

Views: 388

Answers (1)

Everst
Everst

Reputation: 376

The way how topic modelers usually pre-process text with n-grams is they connect them by underscore (say, topic_modeling or white_house). You can do that when identifying big rams themselves. And don't forget to make sure that your tokenizer does not split by underscore (Mallet does if not setting token-regex explicitly).

P.S. NLTK native bigrams collocation finder is super slow - if you want something more efficient look around if you haven't yet or create your own based on, say, Dunning (1993).

Upvotes: 1

Related Questions