Reputation: 181
Let's say we have three datasets with three different years:
ID | Text | Year |
---|---|---|
101 | abc | 1990 |
102 | abd | 1990 |
103 | a | 1990 |
And the second dataset that could (or not) contain the IDs from the first year:
ID | Text | Year |
---|---|---|
104 | bc | 1991 |
101 | abc | 1991 |
102 | abe | 1991 |
And the third dataset:
ID | Text | Year |
---|---|---|
104 | bc | 1992 |
105 | a | 1992 |
I want somehow to merge these three dataframes + add a new column to calculate the text similarity (using TF-IDF) between the common IDs (and uncommon IDs) from those consecutive years + also update the year + the text if we notice there's similarity > 0.8 between text from Year1 and Year2.
Here is the result I want:
ID | Text | Year | Similarity |
---|---|---|---|
101 | abc | 1991 | 1 |
102 | abe | 1991 | TF-IDF('abd', 'abe') |
103 | a | 1990 | 0 |
104 | bc | 1992 | 1 |
105 | a | 1992 | 0 |
So I also want to include those new IDs corresponding to the new years, but also to keep the IDs of the previous year, but without having a match in terms of ID + that similarity column. The merge should not be inner (because we also want to integrate those IDs that are not present in the second/third dataframe) and the year should be updated if the similarity score is above a threshold (let's say if text from ID 104, year 1991 has > 0.8 similarity with the text from ID 104, year 1992).
Thanks
Upvotes: 0
Views: 67
Reputation: 13518
With the dataframes you provided:
import pandas as pd
df1 = pd.DataFrame(
{"ID": [101, 102, 103], "Text": ["abc", "abd", "a"], "Year": [1990, 1990, 1990]}
)
df2 = pd.DataFrame(
{"ID": [104, 101, 102], "Text": ["bc", "abc", "abe"], "Year": [1991, 1991, 1991]}
)
df3 = pd.DataFrame({"ID": [104, 105], "Text": ["bc", "a"], "Year": [1992, 1992]})
Here is one way to do it with Python standard library difflib module, which provides functions for comparing sequences:
# Setup
from difflib import SequenceMatcher
def ratio(a, b):
"""Helper function."""
return SequenceMatcher(None, a, b).ratio()
df = pd.concat([df1, df2, df3])
# Add similarity column with 100% matches and remove duplicates
df["Similarity"] = df.duplicated(subset=["ID", "Text"], keep="first").astype(int)
df = df.drop_duplicates(subset=["ID", "Text"], keep="last")
# Calculate ratio for different Text values
tmp = df.assign(
match=df["Text"].map(lambda x: {value: ratio(x, value) for value in df["Text"]})
)
# Format and filter results
tmp = (
pd.DataFrame(tmp["match"].to_list(), index=tmp["Text"])
.reset_index(drop=False)
.melt("Text", var_name="Other_Text", value_name="ratio")
.dropna()
)
tmp = tmp[(tmp["Text"] != tmp["Other_Text"]) & (tmp["ratio"] > 0.5)]
# Update similarities
df["Similarity"] = df.apply(
lambda x: 1 if x["Text"] in tmp["Text"].values else x["Similarity"], axis=1
)
# Add matching values when similarity is not 100%
df = pd.merge(
left=df,
right=df.groupby("ID").agg({"Text": list}).reset_index(),
how="left",
left_on="ID",
right_on="ID",
)
df.loc[df["ID"].duplicated(), "Similarity"] = df.loc[df["ID"].duplicated(), "Text_y"]
# Cleanup
df = (
df.drop_duplicates("ID", keep="last")
.sort_values("ID", ignore_index=True)
.drop(columns=["Text_x", "Text_y"])
)
Then:
print(df)
# Output
ID Year Similarity
0 101 1991 1
1 102 1991 [abd, abe]
2 103 1990 0
3 104 1992 1
4 105 1992 0
Upvotes: 2