Python gensim LDA: add the topic to the document after getting the topics

Question

I'm using LDA from gensim to perform a topic modeling. I know how to convert the raw text data to corpus and get the topics. But, after I get the topics, can I label or add back the topics results to the raw document?

Here are my codes:

movie_reviews = pd.read_csv(data_path + 'movie_review.tsv',header=0,delimiter='	',quoting=3) 

reviews = []
for i in range(len(movie_reviews['review'])):
reviews.append(review_to_words(movie_reviews['review']
              [i],stops=stopwords.words('english')))
from gensim import corpora
dictionary = corpora.Dictionary(reviews)
corpus = [dictionary.doc2bow(review) for review in reviews]
from gensim import models
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=10)
corpus_lda = lda[corpus_tfidf]
lda.print_topics(10)
[(0,
  u'0.001*ben + 0.001*sinatra + 0.001*santa + 0.001*henry + 0.001*band + 0.001*william + 0.001*fool + 0.001*tragic + 0.001*favourite + 0.001*bed'),
 (1,
  u'0.002*dentist + 0.002*homeless + 0.002*connery + 0.002*hawn + 0.002*judas + 0.002*bogus + 0.001*dickens + 0.001*hilarity + 0.001*snuff + 0.001*chong'),
 (2,
  u'0.002*freddy + 0.002*mst + 0.001*summary + 0.001*aliens + 0.001*fred + 0.001*broke + 0.001*express + 0.001*cube + 0.001*perfection + 0.001*struck'),
 (3,
  u'0.004*ned + 0.003*kidman + 0.002*nicole + 0.002*chuck + 0.002*hart + 0.002*sabrina + 0.002*miyazaki + 0.002*roberts + 0.002*amitabh + 0.001*educational'),
 (4,
  u'0.002*seagal + 0.002*buffy + 0.002*caprica + 0.002*stargate + 0.002*clown + 0.002*travolta + 0.001*bsg + 0.001*goat + 0.001*insomnia + 0.001*update'),
 (5,
  u'0.003*cinderella + 0.002*envy + 0.002*homicide + 0.002*sucker + 0.002*quantum + 0.002*stallone + 0.002*elvira + 0.002*walt + 0.002*lundgren + 0.001*boobs'),
 (6,
  u'0.002*pickford + 0.002*guaranteed + 0.002*swearing + 0.002*eleniak + 0.002*biko + 0.002*tremendously + 0.001*characterisation + 0.001*arnie + 0.001*radical + 0.001*generate'),
 (7,
  u'0.003*sandler + 0.002*dont + 0.002*buff + 0.002*ustinov + 0.002*brosnan + 0.001*amazon + 0.001*perry + 0.001*link + 0.001*maker + 0.001*adam'),
 (8,
  u'0.002*gandhi + 0.002*scarecrow + 0.002*frankie + 0.002*boxing + 0.002*creep + 0.002*worms + 0.002*mcqueen + 0.002*sellers + 0.002*duchovny + 0.002*appearances'),
 (9,
  u'0.002*sentinel + 0.002*scrooge + 0.002*che + 0.002*robots + 0.002*betty + 0.002*wtf + 0.002*redneck + 0.002*unexplained + 0.002*stiller + 0.002*groups') 

print corpus_lda[0]]
[(1, 0.032862717742657352), (2, 0.061544456899498043), (3, 0.17498689066920223), (5, 0.034931340026756269), (6, 0.01142214861116901), (7, 0.01368447078032208), (8, 0.014051012107502465), (9, 0.58954345105937356)]

The last code shows the distribution of each topic in document 1. Now, my question is: how can I convert it into a numeric variable for each topic with its weight for each document?

Desired output in Dataframe:

Document ID  Topic1   Topic2   Topic3.... 
0           0.032    0.062   0.175

As you can see, this is a DataFrame with the topic as column name and weight as the value.

Also, can I link this topic variable back to the raw document, which is movie_review here?

Python gensim LDA: add the topic to the document after getting the topics

Answers (1)

Related Questions