Reputation: 41
I'm using LDA from gensim to perform a topic modeling. I know how to convert the raw text data to corpus and get the topics. But, after I get the topics, can I label or add back the topics results to the raw document?
Here are my codes:
movie_reviews = pd.read_csv(data_path + 'movie_review.tsv',header=0,delimiter='\t',quoting=3)
reviews = []
for i in range(len(movie_reviews['review'])):
reviews.append(review_to_words(movie_reviews['review']
[i],stops=stopwords.words('english')))
from gensim import corpora
dictionary = corpora.Dictionary(reviews)
corpus = [dictionary.doc2bow(review) for review in reviews]
from gensim import models
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=10)
corpus_lda = lda[corpus_tfidf]
lda.print_topics(10)
[(0,
u'0.001*ben + 0.001*sinatra + 0.001*santa + 0.001*henry + 0.001*band + 0.001*william + 0.001*fool + 0.001*tragic + 0.001*favourite + 0.001*bed'),
(1,
u'0.002*dentist + 0.002*homeless + 0.002*connery + 0.002*hawn + 0.002*judas + 0.002*bogus + 0.001*dickens + 0.001*hilarity + 0.001*snuff + 0.001*chong'),
(2,
u'0.002*freddy + 0.002*mst + 0.001*summary + 0.001*aliens + 0.001*fred + 0.001*broke + 0.001*express + 0.001*cube + 0.001*perfection + 0.001*struck'),
(3,
u'0.004*ned + 0.003*kidman + 0.002*nicole + 0.002*chuck + 0.002*hart + 0.002*sabrina + 0.002*miyazaki + 0.002*roberts + 0.002*amitabh + 0.001*educational'),
(4,
u'0.002*seagal + 0.002*buffy + 0.002*caprica + 0.002*stargate + 0.002*clown + 0.002*travolta + 0.001*bsg + 0.001*goat + 0.001*insomnia + 0.001*update'),
(5,
u'0.003*cinderella + 0.002*envy + 0.002*homicide + 0.002*sucker + 0.002*quantum + 0.002*stallone + 0.002*elvira + 0.002*walt + 0.002*lundgren + 0.001*boobs'),
(6,
u'0.002*pickford + 0.002*guaranteed + 0.002*swearing + 0.002*eleniak + 0.002*biko + 0.002*tremendously + 0.001*characterisation + 0.001*arnie + 0.001*radical + 0.001*generate'),
(7,
u'0.003*sandler + 0.002*dont + 0.002*buff + 0.002*ustinov + 0.002*brosnan + 0.001*amazon + 0.001*perry + 0.001*link + 0.001*maker + 0.001*adam'),
(8,
u'0.002*gandhi + 0.002*scarecrow + 0.002*frankie + 0.002*boxing + 0.002*creep + 0.002*worms + 0.002*mcqueen + 0.002*sellers + 0.002*duchovny + 0.002*appearances'),
(9,
u'0.002*sentinel + 0.002*scrooge + 0.002*che + 0.002*robots + 0.002*betty + 0.002*wtf + 0.002*redneck + 0.002*unexplained + 0.002*stiller + 0.002*groups')
print corpus_lda[0]]
[(1, 0.032862717742657352), (2, 0.061544456899498043), (3, 0.17498689066920223), (5, 0.034931340026756269), (6, 0.01142214861116901), (7, 0.01368447078032208), (8, 0.014051012107502465), (9, 0.58954345105937356)]
The last code shows the distribution of each topic in document 1. Now, my question is: how can I convert it into a numeric variable for each topic with its weight for each document?
Desired output in Dataframe:
Document ID Topic1 Topic2 Topic3....
0 0.032 0.062 0.175
As you can see, this is a DataFrame with the topic as column name and weight as the value.
Also, can I link this topic variable back to the raw document, which is movie_review here?
Upvotes: 4
Views: 2543
Reputation: 877
I was facing the same problem. A little bit of searching resulted in the code below:
all_topics = ldamodel.get_document_topics(corpus_lda, minimum_probability=0.0)
all_topics_csr = gensim.matutils.corpus2csc(all_topics)
all_topics_numpy = all_topics_csr.T.toarray()
all_topics_df = pd.DataFrame(all_topics_numpy)
References:
Efficient transformation of gensim TransformedCorpus data to array
How to get a complete topic distribution for a document using gensim LDA?
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_methods.ipynb
Upvotes: 7