Use latent semantic analysis to understand if a document is about a topic

Question

This is an example of the use of latent semantic analysis. For simplicity I have considered 4 documents and 2 topics. The code I used is the following:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import pandas as pd

body = [
    'the quick brown fox',
    'the slow brown dog',
    'the quick red dog',
    'the lazy yellow fox'
]

vectorizer = TfidfVectorizer(use_idf=False, norm='l1')
bag_of_words = vectorizer.fit_transform(body)

svd = TruncatedSVD(n_components=2)
lsa = svd.fit_transform(bag_of_words)

topic_encoded_df = pd.DataFrame(lsa, index=['text_1', 'text_2', 'text_3', 'text_4'], columns=['topic_1', 'topic_2'])

topic_encoded_df is the data frame

        topic_1     topic_2
text_1  0.423726    0.074881
text_2  0.378963    -0.192278
text_3  0.378963    -0.192278
text_4  0.316547    0.360146

Again, this is a trivial case to understand what I did. Is there a way to coherently say if, for example, text_1 is about topic_2 in a meaningful way? Or if text_ 2 is about topic_2? I was thinking of something like Elbow method on column values (sorted in descending order), but I'm afraid negative signs may give wrong indications. Anyone have another idea?

Use latent semantic analysis to understand if a document is about a topic

Answers (0)

Related Questions