How to outer merge 3 or more datasets based on an id and calculate the similarity between them using another column?

Question

Let's say we have three datasets with three different years:

ID	Text	Year
101	abc	1990
102	abd	1990
103	a	1990

And the second dataset that could (or not) contain the IDs from the first year:

ID	Text	Year
104	bc	1991
101	abc	1991
102	abe	1991

And the third dataset:

ID	Text	Year
104	bc	1992
105	a	1992

I want somehow to merge these three dataframes + add a new column to calculate the text similarity (using TF-IDF) between the common IDs (and uncommon IDs) from those consecutive years + also update the year + the text if we notice there's similarity > 0.8 between text from Year1 and Year2.

Here is the result I want:

ID	Text	Year	Similarity
101	abc	1991	1
102	abe	1991	TF-IDF('abd', 'abe')
103	a	1990	0
104	bc	1992	1
105	a	1992	0

So I also want to include those new IDs corresponding to the new years, but also to keep the IDs of the previous year, but without having a match in terms of ID + that similarity column. The merge should not be inner (because we also want to integrate those IDs that are not present in the second/third dataframe) and the year should be updated if the similarity score is above a threshold (let's say if text from ID 104, year 1991 has > 0.8 similarity with the text from ID 104, year 1992).

Thanks

How to outer merge 3 or more datasets based on an id and calculate the similarity between them using another column?

Answers (1)

Related Questions