myamulla_ciencia
myamulla_ciencia

Reputation: 1498

How to get fuzzy matches of given set of names in python polars dataframe?

I'm trying to implement a name duplications for one of our use case.

Here I have a set of 10 names along with their index column as below.

df = pl.from_repr("""
┌───────┬───────────────────┐
│ index ┆ full_name         │
│ ---   ┆ ---               │
│ u32   ┆ str               │
╞═══════╪═══════════════════╡
│ 0     ┆ Mallesham Yamulla │
│ 1     ┆ Velmala Sharath   │
│ 2     ┆ Jagarini Yegurla  │
│ 3     ┆ Sharath Velmala   │
│ 4     ┆ Bhavik Vemulla    │
│ 5     ┆ Yegurla Mahesh    │
│ 6     ┆ Yegurla Jagarini  │
│ 7     ┆ Vermula Bhavik    │
│ 8     ┆ Mahesh Yegurla    │
│ 9     ┆ Yamulla Mallesham │
└───────┴───────────────────┘
""")

Here I would like to calculate fuzzy metrics(Levenshtein,JaroWinkler) per each of name combinations using a rapidxfuzz module as below.

from rapidfuzz import fuzz
from rapidfuzz.distance import Levenshtein,JaroWinkler
round(Levenshtein.normalized_similarity(name_0,name_1),5)
round(JaroWinkler.similarity(name_0,name_1),5)

For example: idx-0 name Mallesham Yamulla to be paired with names having indexes sequence (1,9) names[(0,1),(0,2),(0,3),(0,4),(0,5),(0,6),(0,7),(0,8),(0,9)] and calculate their levenshtein and Jarowrinkler similar percentages.

Next idx-1 name with names index sequence (2,9), idx-2 with name index sequence (3,9), idx-3 with (4,9) so on so forth till (8,9)

The expected output would be :

enter image description here

Upvotes: 1

Views: 1781

Answers (3)

jqurious
jqurious

Reputation: 21724

The author is working on a Rust version: https://github.com/rapidfuzz/rapidfuzz-rs

It can be used in Polars via the polars-ds extension plugin.

import polars as pl
import polars_ds as pds

.join_where() can be used to generate all the "combinations" to be compared.

lf = df.lazy()

lf.join_where(lf, 
   pl.col.index < pl.col.index_right,
).with_columns(
   pds.str_leven("full_name", "full_name_right").alias("leven"),
   pds.str_jw("full_name", "full_name_right").alias("jw"),
   pds.str_fuzz("full_name", "full_name_right").alias("fuzz")
).collect()

Output

shape: (45, 7)
┌───────┬───────────────────┬─────────────┬───────────────────┬───────┬──────────┬──────────┐
│ index ┆ full_name         ┆ index_right ┆ full_name_right   ┆ leven ┆ jw       ┆ fuzz     │
│ ---   ┆ ---               ┆ ---         ┆ ---               ┆ ---   ┆ ---      ┆ ---      │
│ u32   ┆ str               ┆ u32         ┆ str               ┆ u32   ┆ f64      ┆ f64      │
╞═══════╪═══════════════════╪═════════════╪═══════════════════╪═══════╪══════════╪══════════╡
│ 0     ┆ Mallesham Yamulla ┆ 1           ┆ Velmala Sharath   ┆ 14    ┆ 0.651634 ┆ 0.375    │
│ 0     ┆ Mallesham Yamulla ┆ 2           ┆ Jagarini Yegurla  ┆ 11    ┆ 0.531863 ┆ 0.424242 │
│ 0     ┆ Mallesham Yamulla ┆ 3           ┆ Sharath Velmala   ┆ 13    ┆ 0.651634 ┆ 0.375    │
│ 1     ┆ Velmala Sharath   ┆ 2           ┆ Jagarini Yegurla  ┆ 15    ┆ 0.481944 ┆ 0.322581 │
│ 1     ┆ Velmala Sharath   ┆ 3           ┆ Sharath Velmala   ┆ 12    ┆ 0.596825 ┆ 0.466667 │
│ …     ┆ …                 ┆ …           ┆ …                 ┆ …     ┆ …        ┆ …        │
│ 6     ┆ Yegurla Jagarini  ┆ 8           ┆ Mahesh Yegurla    ┆ 13    ┆ 0.624008 ┆ 0.466667 │
│ 6     ┆ Yegurla Jagarini  ┆ 9           ┆ Yamulla Mallesham ┆ 11    ┆ 0.521183 ┆ 0.424242 │
│ 7     ┆ Vermula Bhavik    ┆ 8           ┆ Mahesh Yegurla    ┆ 14    ┆ 0.507937 ┆ 0.285714 │
│ 7     ┆ Vermula Bhavik    ┆ 9           ┆ Yamulla Mallesham ┆ 13    ┆ 0.494398 ┆ 0.451613 │
│ 8     ┆ Mahesh Yegurla    ┆ 9           ┆ Yamulla Mallesham ┆ 14    ┆ 0.494398 ┆ 0.387097 │
└───────┴───────────────────┴─────────────┴───────────────────┴───────┴──────────┴──────────┘

Upvotes: 1

ghuls
ghuls

Reputation: 486

# Change 3 to 1000 or 10000 to split up the cross join part in multiple iterations with a smaller dataframe, which you can run the levenshtine/jarowinkler functions on. That function output you probably should filter to remove rows for which the values are too low.
for x in range(0, df.height, 3):
    df_combinations_x = df.join(
        df.slice(offset=x, length=3),
        how="cross",
        suffix="_2",
    ).filter(
        pl.col("index") != pl.col("index_2")
    )
    print(df_combinations_x)
shape: (9, 4)
┌───────┬───────────┬─────────┬─────────────┐
│ index ┆ full_name ┆ index_2 ┆ full_name_2 │
│ ---   ┆ ---       ┆ ---     ┆ ---         │
│ u32   ┆ str       ┆ u32     ┆ str         │
╞═══════╪═══════════╪═════════╪═════════════╡
│ 0     ┆ Aaaa aaaa ┆ 1       ┆ Baaa abba   │
│ 0     ┆ Aaaa aaaa ┆ 2       ┆ Acac acca   │
│ 1     ┆ Baaa abba ┆ 0       ┆ Aaaa aaaa   │
│ 1     ┆ Baaa abba ┆ 2       ┆ Acac acca   │
│ 2     ┆ Acac acca ┆ 0       ┆ Aaaa aaaa   │
│ 2     ┆ Acac acca ┆ 1       ┆ Baaa abba   │
│ 3     ┆ Dada dddd ┆ 0       ┆ Aaaa aaaa   │
│ 3     ┆ Dada dddd ┆ 1       ┆ Baaa abba   │
│ 3     ┆ Dada dddd ┆ 2       ┆ Acac acca   │
└───────┴───────────┴─────────┴─────────────┘
shape: (3, 4)
┌───────┬───────────┬─────────┬─────────────┐
│ index ┆ full_name ┆ index_2 ┆ full_name_2 │
│ ---   ┆ ---       ┆ ---     ┆ ---         │
│ u32   ┆ str       ┆ u32     ┆ str         │
╞═══════╪═══════════╪═════════╪═════════════╡
│ 0     ┆ Aaaa aaaa ┆ 3       ┆ Dada dddd   │
│ 1     ┆ Baaa abba ┆ 3       ┆ Dada dddd   │
│ 2     ┆ Acac acca ┆ 3       ┆ Dada dddd   │
└───────┴───────────┴─────────┴─────────────┘

Upvotes: 1

ghuls
ghuls

Reputation: 486

Create example dataframe.

df = pl.DataFrame(
    pl.Series("full_name", ["Aaaa aaaa", "Baaa abba", "Acac acca", "Dada dddd"])
).with_row_index()
shape: (4, 2)
┌───────┬───────────┐
│ index ┆ full_name │
│ ---   ┆ ---       │
│ u32   ┆ str       │
╞═══════╪═══════════╡
│ 0     ┆ Aaaa aaaa │
│ 1     ┆ Baaa abba │
│ 2     ┆ Acac acca │
│ 3     ┆ Dada dddd │
└───────┴───────────┘

Join dataframe with itself in a cross join and remove rows where index == index

df_combinations = df.join(
    df,
    how="cross",
    suffix="_2",
).filter(
    pl.col("index") != pl.col("index_2")
)
shape: (12, 4)
┌───────┬───────────┬─────────┬─────────────┐
│ index ┆ full_name ┆ index_2 ┆ full_name_2 │
│ ---   ┆ ---       ┆ ---     ┆ ---         │
│ u32   ┆ str       ┆ u32     ┆ str         │
╞═══════╪═══════════╪═════════╪═════════════╡
│ 0     ┆ Aaaa aaaa ┆ 1       ┆ Baaa abba   │
│ 0     ┆ Aaaa aaaa ┆ 2       ┆ Acac acca   │
│ 0     ┆ Aaaa aaaa ┆ 3       ┆ Dada dddd   │
│ 1     ┆ Baaa abba ┆ 0       ┆ Aaaa aaaa   │
│ 1     ┆ Baaa abba ┆ 2       ┆ Acac acca   │
│ …     ┆ …         ┆ …       ┆ …           │
│ 2     ┆ Acac acca ┆ 1       ┆ Baaa abba   │
│ 2     ┆ Acac acca ┆ 3       ┆ Dada dddd   │
│ 3     ┆ Dada dddd ┆ 0       ┆ Aaaa aaaa   │
│ 3     ┆ Dada dddd ┆ 1       ┆ Baaa abba   │
│ 3     ┆ Dada dddd ┆ 2       ┆ Acac acca   │
└───────┴───────────┴─────────┴─────────────┘

Run rapidfuzz using map_elements

df_combinations.with_columns(
    # Combine "index" and "index_2" columns to one struct column.
    pl.struct("index", "index_2").alias("index_comb"),
    # Combine "full_name" and "full_name_2" columns to one struct column.
    pl.struct("full_name", "full_name_2").alias("full_name_comb"),
).with_columns(
    # Run custom functions on struct column.
    pl.col("full_name_comb").map_elements(lambda t: Levenshtein.normalized_similarity(t["full_name"], t["full_name_2"])).alias("levenshtein"),
    pl.col("full_name_comb").map_elements(lambda t: JaroWinkler.similarity(t["full_name"], t["full_name_2"])).alias("jarowinkler"),
)
shape: (12, 8)
┌───────┬───────────┬─────────┬─────────────┬────────────┬───────────────────────────┬─────────────┬─────────────┐
│ index ┆ full_name ┆ index_2 ┆ full_name_2 ┆ index_comb ┆ full_name_comb            ┆ levenshtein ┆ jarowinkler │
│ ---   ┆ ---       ┆ ---     ┆ ---         ┆ ---        ┆ ---                       ┆ ---         ┆ ---         │
│ u32   ┆ str       ┆ u32     ┆ str         ┆ struct[2]  ┆ struct[2]                 ┆ f64         ┆ f64         │
╞═══════╪═══════════╪═════════╪═════════════╪════════════╪═══════════════════════════╪═════════════╪═════════════╡
│ 0     ┆ Aaaa aaaa ┆ 1       ┆ Baaa abba   ┆ {0,1}      ┆ {"Aaaa aaaa","Baaa abba"} ┆ 0.666667    ┆ 0.777778    │
│ 0     ┆ Aaaa aaaa ┆ 2       ┆ Acac acca   ┆ {0,2}      ┆ {"Aaaa aaaa","Acac acca"} ┆ 0.555556    ┆ 0.637037    │
│ 0     ┆ Aaaa aaaa ┆ 3       ┆ Dada dddd   ┆ {0,3}      ┆ {"Aaaa aaaa","Dada dddd"} ┆ 0.333333    ┆ 0.555556    │
│ 1     ┆ Baaa abba ┆ 0       ┆ Aaaa aaaa   ┆ {1,0}      ┆ {"Baaa abba","Aaaa aaaa"} ┆ 0.666667    ┆ 0.777778    │
│ 1     ┆ Baaa abba ┆ 2       ┆ Acac acca   ┆ {1,2}      ┆ {"Baaa abba","Acac acca"} ┆ 0.444444    ┆ 0.546296    │
│ …     ┆ …         ┆ …       ┆ …           ┆ …          ┆ …                         ┆ …           ┆ …           │
│ 2     ┆ Acac acca ┆ 1       ┆ Baaa abba   ┆ {2,1}      ┆ {"Acac acca","Baaa abba"} ┆ 0.444444    ┆ 0.546296    │
│ 2     ┆ Acac acca ┆ 3       ┆ Dada dddd   ┆ {2,3}      ┆ {"Acac acca","Dada dddd"} ┆ 0.111111    ┆ 0.444444    │
│ 3     ┆ Dada dddd ┆ 0       ┆ Aaaa aaaa   ┆ {3,0}      ┆ {"Dada dddd","Aaaa aaaa"} ┆ 0.333333    ┆ 0.555556    │
│ 3     ┆ Dada dddd ┆ 1       ┆ Baaa abba   ┆ {3,1}      ┆ {"Dada dddd","Baaa abba"} ┆ 0.333333    ┆ 0.555556    │
│ 3     ┆ Dada dddd ┆ 2       ┆ Acac acca   ┆ {3,2}      ┆ {"Dada dddd","Acac acca"} ┆ 0.111111    ┆ 0.444444    │
└───────┴───────────┴─────────┴─────────────┴────────────┴───────────────────────────┴─────────────┴─────────────┘

Upvotes: 2

Related Questions