Reputation: 168
Given I have a dataframe that contains rows like this
ID | Title | Abstract | Keywords | Author | Year |
---|---|---|---|---|---|
5875 | Textual Similarity: A Review | Textual Similarity has been used for measuring ... | X, Y, Z | James Thomas | 2018 |
8596 | Natural Language Processing: A Review | Natural Language Processing has been used for ... | NLP, AI, BERT | Rami John | 2015 |
4586 | Textual Similarity: Systematic Review | Text Similarity is being used for | Y, Z, AI | J Thomas | 2018 |
I would like to make a function deduplicate
which can ingest the dataframe and outputs a matrix that allows me to compare the records with each other.
def deduplicate(df):
matrix = take in each row and compute a similarity matrix
return matrix
Whereas matrix can be
ID | 5875 | 8596 | 4586 |
---|---|---|---|
5875 | 1 | 0.4 | 0.9 |
8596 | 0.4 | 1 | 0.5 |
4586 | 0.9 | 0.5 | 1 |
This will allow me to find which records are similar to each other by comparing how similar the records are. I think I need to use some NLP Models here, as the rows contain textual as well as numerical data.
Is there a way in Python to do this? Some people suggest using dedupe, but due to privacy laws at place in my organization, we can only have in-house capacity for the same. Any suggestions would be welcome.
Upvotes: 1
Views: 887
Reputation: 2056
The easiest way to improve your comparison is Using TF-IDF (comprehensive explanation here)
One of the main weaknesses of fuzzy-wuzzy
package is the ignorance of the importance of each string trail (subtoken, token, 2-gram, and ...). For example, two documents that contain the word Unicorn are most probably more similar to each other than two documents with the word USA (due to the overall scarcity of the word Unicorn). This is where a handy tool named TFIDF
comes to play. TFIDF would consider the weight of each (n-gram, n-char) for measuring the similarity. Moreover, it's easy to use tanks to sklearn
library.
#your corpus
corpus = ['The sun is the largest celestial body in the solar system',
'The solar system consists of the sun and eight revolving planets',
'Ra was the Egyptian Sun God',
'The Pyramids were the pinnacle of Egyptian architecture',
'The quick brown fox jumps over the lazy dog']
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer()
# Generate the tf-idf vectors for the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
# compute and print the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)
There are plenty of more advanced methods you can exploit to improve the result.
Upvotes: 1