Python NLTK tokenizing text using already found bigrams

Question

Background: I got a lot of text that has some technical expressions, which are not always standard.

I know how to find the bigrams and filter them.

Now, I want to use them when tokenizing the sentences. So words that should stay together (according to the calculated bigrams) are kept together.

I would like to know if there is a correct way to doing this within NLTK. If not, I can think of various non efficient ways of rejoining all the broken words by checking dictionaries.

Everst · Accepted Answer

The way how topic modelers usually pre-process text with n-grams is they connect them by underscore (say, topic_modeling or white_house). You can do that when identifying big rams themselves. And don't forget to make sure that your tokenizer does not split by underscore (Mallet does if not setting token-regex explicitly).

P.S. NLTK native bigrams collocation finder is super slow - if you want something more efficient look around if you haven't yet or create your own based on, say, Dunning (1993).

Python NLTK tokenizing text using already found bigrams

Answers (1)

Related Questions