Daniel
Daniel

Reputation: 263

How to select text data based on benchmark using TF-IDF weighted Jaccard similarity?

I have a set of benchmark articles with title, abstract and doi that I want to use as a "training model" to select articles from a bigger corpora. For the first iteration, I only want to use the benchmark titles to created the TF-IDF model, on which then I can select articles from the corpora titles'. The selection is with TF-IDF weighted Jaccard similarity with a threshold above 0.5 are selected for further reading.

Below the code, but I have an extra question.

Should the TF-IDF model be "trained" taking all benchmark['title_token'] as a single set?

The code below breaks as not calculating the TF-IDF weighted Jaccard similarity and results in ValueError: setting an array element with a sequence.

import os
import gc
import re
import nltk
from nltk.corpus import stopwords
import string
import numpy as np
import pandas as pd
from tqdm import tqdm
import dask.dataframe as dd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score

""" Step 1: Prepare benchmark and corpus articles """

benchmark['title_token'] = benchmark['title_tokenizade']
corpora['title_token'] = corpora['title_tokenizade']


""" Step 2: Vectorize Text Data, apply Jaccard Index and TF-IDF """

# Calculate TF-IDF weights for benchmark articles
tfidf_vectorizer_benchmark = TfidfVectorizer()
tfidf_matrix_benchmark = tfidf_vectorizer_benchmark.fit_transform(benchmark['title_token'])

# Calculate TF-IDF weights for corpus articles
tfidf_vectorizer_corpus = TfidfVectorizer(vocabulary=tfidf_vectorizer_benchmark.vocabulary_)  # Use the same vocabulary as benchmark
tfidf_matrix_corpus = tfidf_vectorizer_corpus.fit_transform(corpora['title_token'])

# Define threshold for similarity score
threshold = 0.5  # Adjust as needed

# Compute TF-IDF weighted Jaccard similarity
num_benchmark = len(benchmark)
num_corpus = len(corpora)
similarity_scores = np.zeros((num_benchmark, num_corpus))

for i, benchmark_row in enumerate(tfidf_matrix_benchmark):
    for j, corpus_row in enumerate(tfidf_matrix_corpus):
       
        # Convert TF-IDF vectors to binary arrays
        benchmark_binary = (benchmark_row > 0).astype(int)
        corpus_binary = (corpus_row > 0).astype(int)       
        
        # Print shapes for debugging
        print(f"Shape of benchmark_binary: {benchmark_binary.shape}")
        print(f"Shape of corpus_binary: {corpus_binary.shape}")
        
        # Calculate Jaccard similarity
        similarity_scores[i, j] = jaccard_score(benchmark_binary, corpus_binary, average=None)

Upvotes: 0

Views: 62

Answers (0)

Related Questions