stephsmith
stephsmith

Reputation: 181

How to outer merge 3 or more datasets based on an id and calculate the similarity between them using another column?

Let's say we have three datasets with three different years:

ID Text Year
101 abc 1990
102 abd 1990
103 a 1990

And the second dataset that could (or not) contain the IDs from the first year:

ID Text Year
104 bc 1991
101 abc 1991
102 abe 1991

And the third dataset:

ID Text Year
104 bc 1992
105 a 1992

I want somehow to merge these three dataframes + add a new column to calculate the text similarity (using TF-IDF) between the common IDs (and uncommon IDs) from those consecutive years + also update the year + the text if we notice there's similarity > 0.8 between text from Year1 and Year2.

Here is the result I want:

ID Text Year Similarity
101 abc 1991 1
102 abe 1991 TF-IDF('abd', 'abe')
103 a 1990 0
104 bc 1992 1
105 a 1992 0

So I also want to include those new IDs corresponding to the new years, but also to keep the IDs of the previous year, but without having a match in terms of ID + that similarity column. The merge should not be inner (because we also want to integrate those IDs that are not present in the second/third dataframe) and the year should be updated if the similarity score is above a threshold (let's say if text from ID 104, year 1991 has > 0.8 similarity with the text from ID 104, year 1992).

Thanks

Upvotes: 0

Views: 67

Answers (1)

Laurent
Laurent

Reputation: 13518

With the dataframes you provided:

import pandas as pd

df1 = pd.DataFrame(
    {"ID": [101, 102, 103], "Text": ["abc", "abd", "a"], "Year": [1990, 1990, 1990]}
)
df2 = pd.DataFrame(
    {"ID": [104, 101, 102], "Text": ["bc", "abc", "abe"], "Year": [1991, 1991, 1991]}
)
df3 = pd.DataFrame({"ID": [104, 105], "Text": ["bc", "a"], "Year": [1992, 1992]})

Here is one way to do it with Python standard library difflib module, which provides functions for comparing sequences:

# Setup
from difflib import SequenceMatcher

def ratio(a, b):
    """Helper function."""
    return SequenceMatcher(None, a, b).ratio()

df = pd.concat([df1, df2, df3])
# Add similarity column with 100% matches and remove duplicates
df["Similarity"] = df.duplicated(subset=["ID", "Text"], keep="first").astype(int)
df = df.drop_duplicates(subset=["ID", "Text"], keep="last")

# Calculate ratio for different Text values
tmp = df.assign(
    match=df["Text"].map(lambda x: {value: ratio(x, value) for value in df["Text"]})
)

# Format and filter results
tmp = (
    pd.DataFrame(tmp["match"].to_list(), index=tmp["Text"])
    .reset_index(drop=False)
    .melt("Text", var_name="Other_Text", value_name="ratio")
    .dropna()
)
tmp = tmp[(tmp["Text"] != tmp["Other_Text"]) & (tmp["ratio"] > 0.5)]

# Update similarities
df["Similarity"] = df.apply(
    lambda x: 1 if x["Text"] in tmp["Text"].values else x["Similarity"], axis=1
)

# Add matching values when similarity is not 100%
df = pd.merge(
    left=df,
    right=df.groupby("ID").agg({"Text": list}).reset_index(),
    how="left",
    left_on="ID",
    right_on="ID",
)
df.loc[df["ID"].duplicated(), "Similarity"] = df.loc[df["ID"].duplicated(), "Text_y"]

# Cleanup
df = (
    df.drop_duplicates("ID", keep="last")
    .sort_values("ID", ignore_index=True)
    .drop(columns=["Text_x", "Text_y"])
)

Then:

print(df)
# Output
    ID  Year  Similarity
0  101  1991           1
1  102  1991  [abd, abe]
2  103  1990           0
3  104  1992           1
4  105  1992           0

Upvotes: 2

Related Questions