rek
rek

Reputation: 187

Capture bigram topics instead of unigrams using latent dirichlet allocat

I try to make an attempt like this question

LDA Original Output

Uni-grams

    topic1 -scuba,water,vapor,diving

    topic2 -dioxide,plants,green,carbon

Required Output

Bi-gram topics

    topic1 -scuba diving,water vapor

    topic2 -green plants,carbon dioxide

And there is this answer

from nltk.util import ngrams

for doc in docs:
    docs[doc] = docs[doc] + ["_".join(w) for w in ngrams(docs[doc], 2)]

Any help what update should I make in order to have only bigrams?

Upvotes: 1

Views: 394

Answers (1)

roddar92
roddar92

Reputation: 366

Create only documents with bigrams:

from nltk.util import ngrams

for doc in docs:
    docs[doc] = ["_".join(w) for w in ngrams(docs[doc], 2)]

Or specific method for bigrams:

from nltk.util import bigrams

for doc in docs:
    docs[doc] = ["_".join(w) for w in bigrams(docs[doc])]

Then use lists of these bigrams in texts for future operations.

Upvotes: 2

Related Questions