Alex16237
Alex16237

Reputation: 199

Do I use TD-IDF correctly over corpus of raw documents?

I am in doubt as to whether I use my TD-IDF calculations correctly. I have a large corpus of different documents, each document is stored in it's own row using pandas dataframe. I feed each row to TD-IDF in scikit-learn and store feature_names (words) in a list.

I am using following code:

term_tdidf = []

def tdidf_f(vec, matrix):
    f_array = np.array(vec.get_feature_names())
    t_sort = np.argsort(matrix.toarray()).flatten()[::-1]
    n = 100
    top_term = f_array[t_sort][:n]
    term_tdidf.append(set(top_term))

for row in df.document:
        x = TfidfVectorizer(stop_words='english')
        tfidf_matrix = x.fit_transform(row)
        terms = x.get_feature_names()
        tdidf_f(x, tfidf_matrix)

After that I create new dataframe where each set of tdidf from each document is stored in a separate column.

Is that correct use of TD-IDF? I am running it only on single document, so terms I am getting are only calculated within this one document, correct? As I understand td-idf should be used across all documents to find one set of frequent terms, not multiple sets. Are there any consequences of such application?

My manual review of extracted features from each document indicates that terms I am getting are fitting. Afterwards I am using those terms to calculate similarity between documents and it seems to be correct.

Upvotes: 0

Views: 245

Answers (1)

Stanislas Morbieu
Stanislas Morbieu

Reputation: 1827

To compute the IDF part of the weighting, you need to count the number of times the term occurs in the whole corpus, so your code is incorrect.

Here is a minimal example of how to use it:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["My first document", "My second document"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

X is then a matrix where each row is a document and each column represents a term.

Upvotes: 1

Related Questions