saving_space
saving_space

Reputation: 168

AI Based Deduplication using Textual Similarity Measure in Python

Given I have a dataframe that contains rows like this

ID Title Abstract Keywords Author Year
5875 Textual Similarity: A Review Textual Similarity has been used for measuring ... X, Y, Z James Thomas 2018
8596 Natural Language Processing: A Review Natural Language Processing has been used for ... NLP, AI, BERT Rami John 2015
4586 Textual Similarity: Systematic Review Text Similarity is being used for Y, Z, AI J Thomas 2018

I would like to make a function deduplicate which can ingest the dataframe and outputs a matrix that allows me to compare the records with each other.

def deduplicate(df):
    matrix = take in each row and compute a similarity matrix
    return matrix

Whereas matrix can be

ID 5875 8596 4586
5875 1 0.4 0.9
8596 0.4 1 0.5
4586 0.9 0.5 1

This will allow me to find which records are similar to each other by comparing how similar the records are. I think I need to use some NLP Models here, as the rows contain textual as well as numerical data.

Is there a way in Python to do this? Some people suggest using dedupe, but due to privacy laws at place in my organization, we can only have in-house capacity for the same. Any suggestions would be welcome.

Upvotes: 1

Views: 887

Answers (1)

Meti
Meti

Reputation: 2056

The easiest way to improve your comparison is Using TF-IDF (comprehensive explanation here)

One of the main weaknesses of fuzzy-wuzzy package is the ignorance of the importance of each string trail (subtoken, token, 2-gram, and ...). For example, two documents that contain the word Unicorn are most probably more similar to each other than two documents with the word USA (due to the overall scarcity of the word Unicorn). This is where a handy tool named TFIDF comes to play. TFIDF would consider the weight of each (n-gram, n-char) for measuring the similarity. Moreover, it's easy to use tanks to sklearn library.

#your corpus
corpus = ['The sun is the largest celestial body in the solar system', 
          'The solar system consists of the sun and eight revolving planets', 
          'Ra was the Egyptian Sun God', 
          'The Pyramids were the pinnacle of Egyptian architecture', 
          'The quick brown fox jumps over the lazy dog']
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Generate the tf-idf vectors for the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# compute and print the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)

There are plenty of more advanced methods you can exploit to improve the result.

Upvotes: 1

Related Questions