AI Based Deduplication using Textual Similarity Measure in Python

Question

Given I have a dataframe that contains rows like this

ID	Title	Abstract	Keywords	Author	Year
5875	Textual Similarity: A Review	Textual Similarity has been used for measuring ...	X, Y, Z	James Thomas	2018
8596	Natural Language Processing: A Review	Natural Language Processing has been used for ...	NLP, AI, BERT	Rami John	2015
4586	Textual Similarity: Systematic Review	Text Similarity is being used for	Y, Z, AI	J Thomas	2018

I would like to make a function deduplicate which can ingest the dataframe and outputs a matrix that allows me to compare the records with each other.

def deduplicate(df):
    matrix = take in each row and compute a similarity matrix
    return matrix

Whereas matrix can be

ID	5875	8596	4586
5875	1	0.4	0.9
8596	0.4	1	0.5
4586	0.9	0.5	1

This will allow me to find which records are similar to each other by comparing how similar the records are. I think I need to use some NLP Models here, as the rows contain textual as well as numerical data.

Is there a way in Python to do this? Some people suggest using dedupe, but due to privacy laws at place in my organization, we can only have in-house capacity for the same. Any suggestions would be welcome.

AI Based Deduplication using Textual Similarity Measure in Python

Answers (1)

Related Questions