LJG
LJG

Reputation: 767

Use latent semantic analysis to understand if a document is about a topic

This is an example of the use of latent semantic analysis. For simplicity I have considered 4 documents and 2 topics. The code I used is the following:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import pandas as pd

body = [
    'the quick brown fox',
    'the slow brown dog',
    'the quick red dog',
    'the lazy yellow fox'
]

vectorizer = TfidfVectorizer(use_idf=False, norm='l1')
bag_of_words = vectorizer.fit_transform(body)

svd = TruncatedSVD(n_components=2)
lsa = svd.fit_transform(bag_of_words)

topic_encoded_df = pd.DataFrame(lsa, index=['text_1', 'text_2', 'text_3', 'text_4'], columns=['topic_1', 'topic_2'])

topic_encoded_df is the data frame

        topic_1     topic_2
text_1  0.423726    0.074881
text_2  0.378963    -0.192278
text_3  0.378963    -0.192278
text_4  0.316547    0.360146

Again, this is a trivial case to understand what I did. Is there a way to coherently say if, for example, text_1 is about topic_2 in a meaningful way? Or if text_ 2 is about topic_2? I was thinking of something like Elbow method on column values (sorted in descending order), but I'm afraid negative signs may give wrong indications. Anyone have another idea?

Upvotes: 0

Views: 372

Answers (0)

Related Questions