Reputation: 767
This is an example of the use of latent semantic analysis. For simplicity I have considered 4 documents and 2 topics. The code I used is the following:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import pandas as pd
body = [
'the quick brown fox',
'the slow brown dog',
'the quick red dog',
'the lazy yellow fox'
]
vectorizer = TfidfVectorizer(use_idf=False, norm='l1')
bag_of_words = vectorizer.fit_transform(body)
svd = TruncatedSVD(n_components=2)
lsa = svd.fit_transform(bag_of_words)
topic_encoded_df = pd.DataFrame(lsa, index=['text_1', 'text_2', 'text_3', 'text_4'], columns=['topic_1', 'topic_2'])
topic_encoded_df
is the data frame
topic_1 topic_2
text_1 0.423726 0.074881
text_2 0.378963 -0.192278
text_3 0.378963 -0.192278
text_4 0.316547 0.360146
Again, this is a trivial case to understand what I did.
Is there a way to coherently say if, for example, text_1
is about topic_2
in a meaningful way? Or if text_ 2
is about topic_2
? I was thinking of something like Elbow method on column values (sorted in descending order), but I'm afraid negative signs may give wrong indications. Anyone have another idea?
Upvotes: 0
Views: 372