Reputation: 1498
I'm trying to implement a name duplications for one of our use case.
Here I have a set of 10 names along with their index column as below.
df = pl.from_repr("""
┌───────┬───────────────────┐
│ index ┆ full_name │
│ --- ┆ --- │
│ u32 ┆ str │
╞═══════╪═══════════════════╡
│ 0 ┆ Mallesham Yamulla │
│ 1 ┆ Velmala Sharath │
│ 2 ┆ Jagarini Yegurla │
│ 3 ┆ Sharath Velmala │
│ 4 ┆ Bhavik Vemulla │
│ 5 ┆ Yegurla Mahesh │
│ 6 ┆ Yegurla Jagarini │
│ 7 ┆ Vermula Bhavik │
│ 8 ┆ Mahesh Yegurla │
│ 9 ┆ Yamulla Mallesham │
└───────┴───────────────────┘
""")
Here I would like to calculate fuzzy metrics(Levenshtein,JaroWinkler) per each of name combinations using a rapidxfuzz module as below.
from rapidfuzz import fuzz
from rapidfuzz.distance import Levenshtein,JaroWinkler
round(Levenshtein.normalized_similarity(name_0,name_1),5)
round(JaroWinkler.similarity(name_0,name_1),5)
For example: idx-0 name Mallesham Yamulla to be paired with names having indexes sequence (1,9) names[(0,1),(0,2),(0,3),(0,4),(0,5),(0,6),(0,7),(0,8),(0,9)] and calculate their levenshtein and Jarowrinkler similar percentages.
Next idx-1 name with names index sequence (2,9), idx-2 with name index sequence (3,9), idx-3 with (4,9) so on so forth till (8,9)
The expected output would be :
Upvotes: 1
Views: 1781
Reputation: 21724
The rapidfuzz author is working on a Rust version: https://github.com/rapidfuzz/rapidfuzz-rs
It can be used in Polars via the polars-ds extension plugin.
import polars as pl
import polars_ds as pds
.join_where()
can be used to generate all the "combinations" to be compared.
lf = df.lazy()
lf.join_where(lf,
pl.col.index < pl.col.index_right,
).with_columns(
pds.str_leven("full_name", "full_name_right").alias("leven"),
pds.str_jw("full_name", "full_name_right").alias("jw"),
pds.str_fuzz("full_name", "full_name_right").alias("fuzz")
).collect()
Output
shape: (45, 7)
┌───────┬───────────────────┬─────────────┬───────────────────┬───────┬──────────┬──────────┐
│ index ┆ full_name ┆ index_right ┆ full_name_right ┆ leven ┆ jw ┆ fuzz │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ u32 ┆ str ┆ u32 ┆ f64 ┆ f64 │
╞═══════╪═══════════════════╪═════════════╪═══════════════════╪═══════╪══════════╪══════════╡
│ 0 ┆ Mallesham Yamulla ┆ 1 ┆ Velmala Sharath ┆ 14 ┆ 0.651634 ┆ 0.375 │
│ 0 ┆ Mallesham Yamulla ┆ 2 ┆ Jagarini Yegurla ┆ 11 ┆ 0.531863 ┆ 0.424242 │
│ 0 ┆ Mallesham Yamulla ┆ 3 ┆ Sharath Velmala ┆ 13 ┆ 0.651634 ┆ 0.375 │
│ 1 ┆ Velmala Sharath ┆ 2 ┆ Jagarini Yegurla ┆ 15 ┆ 0.481944 ┆ 0.322581 │
│ 1 ┆ Velmala Sharath ┆ 3 ┆ Sharath Velmala ┆ 12 ┆ 0.596825 ┆ 0.466667 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 6 ┆ Yegurla Jagarini ┆ 8 ┆ Mahesh Yegurla ┆ 13 ┆ 0.624008 ┆ 0.466667 │
│ 6 ┆ Yegurla Jagarini ┆ 9 ┆ Yamulla Mallesham ┆ 11 ┆ 0.521183 ┆ 0.424242 │
│ 7 ┆ Vermula Bhavik ┆ 8 ┆ Mahesh Yegurla ┆ 14 ┆ 0.507937 ┆ 0.285714 │
│ 7 ┆ Vermula Bhavik ┆ 9 ┆ Yamulla Mallesham ┆ 13 ┆ 0.494398 ┆ 0.451613 │
│ 8 ┆ Mahesh Yegurla ┆ 9 ┆ Yamulla Mallesham ┆ 14 ┆ 0.494398 ┆ 0.387097 │
└───────┴───────────────────┴─────────────┴───────────────────┴───────┴──────────┴──────────┘
Upvotes: 1
Reputation: 486
# Change 3 to 1000 or 10000 to split up the cross join part in multiple iterations with a smaller dataframe, which you can run the levenshtine/jarowinkler functions on. That function output you probably should filter to remove rows for which the values are too low.
for x in range(0, df.height, 3):
df_combinations_x = df.join(
df.slice(offset=x, length=3),
how="cross",
suffix="_2",
).filter(
pl.col("index") != pl.col("index_2")
)
print(df_combinations_x)
shape: (9, 4)
┌───────┬───────────┬─────────┬─────────────┐
│ index ┆ full_name ┆ index_2 ┆ full_name_2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ u32 ┆ str │
╞═══════╪═══════════╪═════════╪═════════════╡
│ 0 ┆ Aaaa aaaa ┆ 1 ┆ Baaa abba │
│ 0 ┆ Aaaa aaaa ┆ 2 ┆ Acac acca │
│ 1 ┆ Baaa abba ┆ 0 ┆ Aaaa aaaa │
│ 1 ┆ Baaa abba ┆ 2 ┆ Acac acca │
│ 2 ┆ Acac acca ┆ 0 ┆ Aaaa aaaa │
│ 2 ┆ Acac acca ┆ 1 ┆ Baaa abba │
│ 3 ┆ Dada dddd ┆ 0 ┆ Aaaa aaaa │
│ 3 ┆ Dada dddd ┆ 1 ┆ Baaa abba │
│ 3 ┆ Dada dddd ┆ 2 ┆ Acac acca │
└───────┴───────────┴─────────┴─────────────┘
shape: (3, 4)
┌───────┬───────────┬─────────┬─────────────┐
│ index ┆ full_name ┆ index_2 ┆ full_name_2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ u32 ┆ str │
╞═══════╪═══════════╪═════════╪═════════════╡
│ 0 ┆ Aaaa aaaa ┆ 3 ┆ Dada dddd │
│ 1 ┆ Baaa abba ┆ 3 ┆ Dada dddd │
│ 2 ┆ Acac acca ┆ 3 ┆ Dada dddd │
└───────┴───────────┴─────────┴─────────────┘
Upvotes: 1
Reputation: 486
Create example dataframe.
df = pl.DataFrame(
pl.Series("full_name", ["Aaaa aaaa", "Baaa abba", "Acac acca", "Dada dddd"])
).with_row_index()
shape: (4, 2)
┌───────┬───────────┐
│ index ┆ full_name │
│ --- ┆ --- │
│ u32 ┆ str │
╞═══════╪═══════════╡
│ 0 ┆ Aaaa aaaa │
│ 1 ┆ Baaa abba │
│ 2 ┆ Acac acca │
│ 3 ┆ Dada dddd │
└───────┴───────────┘
Join dataframe with itself in a cross join and remove rows where index == index
df_combinations = df.join(
df,
how="cross",
suffix="_2",
).filter(
pl.col("index") != pl.col("index_2")
)
shape: (12, 4)
┌───────┬───────────┬─────────┬─────────────┐
│ index ┆ full_name ┆ index_2 ┆ full_name_2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ u32 ┆ str │
╞═══════╪═══════════╪═════════╪═════════════╡
│ 0 ┆ Aaaa aaaa ┆ 1 ┆ Baaa abba │
│ 0 ┆ Aaaa aaaa ┆ 2 ┆ Acac acca │
│ 0 ┆ Aaaa aaaa ┆ 3 ┆ Dada dddd │
│ 1 ┆ Baaa abba ┆ 0 ┆ Aaaa aaaa │
│ 1 ┆ Baaa abba ┆ 2 ┆ Acac acca │
│ … ┆ … ┆ … ┆ … │
│ 2 ┆ Acac acca ┆ 1 ┆ Baaa abba │
│ 2 ┆ Acac acca ┆ 3 ┆ Dada dddd │
│ 3 ┆ Dada dddd ┆ 0 ┆ Aaaa aaaa │
│ 3 ┆ Dada dddd ┆ 1 ┆ Baaa abba │
│ 3 ┆ Dada dddd ┆ 2 ┆ Acac acca │
└───────┴───────────┴─────────┴─────────────┘
Run rapidfuzz using map_elements
df_combinations.with_columns(
# Combine "index" and "index_2" columns to one struct column.
pl.struct("index", "index_2").alias("index_comb"),
# Combine "full_name" and "full_name_2" columns to one struct column.
pl.struct("full_name", "full_name_2").alias("full_name_comb"),
).with_columns(
# Run custom functions on struct column.
pl.col("full_name_comb").map_elements(lambda t: Levenshtein.normalized_similarity(t["full_name"], t["full_name_2"])).alias("levenshtein"),
pl.col("full_name_comb").map_elements(lambda t: JaroWinkler.similarity(t["full_name"], t["full_name_2"])).alias("jarowinkler"),
)
shape: (12, 8)
┌───────┬───────────┬─────────┬─────────────┬────────────┬───────────────────────────┬─────────────┬─────────────┐
│ index ┆ full_name ┆ index_2 ┆ full_name_2 ┆ index_comb ┆ full_name_comb ┆ levenshtein ┆ jarowinkler │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ u32 ┆ str ┆ struct[2] ┆ struct[2] ┆ f64 ┆ f64 │
╞═══════╪═══════════╪═════════╪═════════════╪════════════╪═══════════════════════════╪═════════════╪═════════════╡
│ 0 ┆ Aaaa aaaa ┆ 1 ┆ Baaa abba ┆ {0,1} ┆ {"Aaaa aaaa","Baaa abba"} ┆ 0.666667 ┆ 0.777778 │
│ 0 ┆ Aaaa aaaa ┆ 2 ┆ Acac acca ┆ {0,2} ┆ {"Aaaa aaaa","Acac acca"} ┆ 0.555556 ┆ 0.637037 │
│ 0 ┆ Aaaa aaaa ┆ 3 ┆ Dada dddd ┆ {0,3} ┆ {"Aaaa aaaa","Dada dddd"} ┆ 0.333333 ┆ 0.555556 │
│ 1 ┆ Baaa abba ┆ 0 ┆ Aaaa aaaa ┆ {1,0} ┆ {"Baaa abba","Aaaa aaaa"} ┆ 0.666667 ┆ 0.777778 │
│ 1 ┆ Baaa abba ┆ 2 ┆ Acac acca ┆ {1,2} ┆ {"Baaa abba","Acac acca"} ┆ 0.444444 ┆ 0.546296 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 2 ┆ Acac acca ┆ 1 ┆ Baaa abba ┆ {2,1} ┆ {"Acac acca","Baaa abba"} ┆ 0.444444 ┆ 0.546296 │
│ 2 ┆ Acac acca ┆ 3 ┆ Dada dddd ┆ {2,3} ┆ {"Acac acca","Dada dddd"} ┆ 0.111111 ┆ 0.444444 │
│ 3 ┆ Dada dddd ┆ 0 ┆ Aaaa aaaa ┆ {3,0} ┆ {"Dada dddd","Aaaa aaaa"} ┆ 0.333333 ┆ 0.555556 │
│ 3 ┆ Dada dddd ┆ 1 ┆ Baaa abba ┆ {3,1} ┆ {"Dada dddd","Baaa abba"} ┆ 0.333333 ┆ 0.555556 │
│ 3 ┆ Dada dddd ┆ 2 ┆ Acac acca ┆ {3,2} ┆ {"Dada dddd","Acac acca"} ┆ 0.111111 ┆ 0.444444 │
└───────┴───────────┴─────────┴─────────────┴────────────┴───────────────────────────┴─────────────┴─────────────┘
Upvotes: 2