Reputation: 1715
Background: I got a lot of text that has some technical expressions, which are not always standard.
I know how to find the bigrams and filter them.
Now, I want to use them when tokenizing the sentences. So words that should stay together (according to the calculated bigrams) are kept together.
I would like to know if there is a correct way to doing this within NLTK. If not, I can think of various non efficient ways of rejoining all the broken words by checking dictionaries.
Upvotes: 0
Views: 388
Reputation: 376
The way how topic modelers usually pre-process text with n-grams is they connect them by underscore (say, topic_modeling or white_house). You can do that when identifying big rams themselves. And don't forget to make sure that your tokenizer does not split by underscore (Mallet does if not setting token-regex explicitly).
P.S. NLTK native bigrams collocation finder is super slow - if you want something more efficient look around if you haven't yet or create your own based on, say, Dunning (1993).
Upvotes: 1