John Constantine
John Constantine

Reputation: 1092

Python string matching with Spark dataframe

I have a spark dataframe

id  | city  | fruit | quantity
-------------------------
0   |  CA   | apple | 300
1   |  CA   | appel | 100
2   |  CA   | orange| 20
3   |  CA   | berry | 10

I want to get rows where fruits are apple or orange. So I use Spark SQL:

SELECT * FROM table WHERE fruit LIKE '%apple%' OR fruit LIKE '%orange%';

It returns

id  | city  | fruit | quantity
-------------------------
0   |  CA   | apple | 300
2   |  CA   | orange| 20

But it is supposed to return

id  | city  | fruit | quantity
-------------------------
0   |  CA   | apple | 300
1   |  CA   | appel | 100
2   |  CA   | orange| 20

as row 1 is just a misspelling.

So I plan on using fuzzywuzzy for string matching. I know that

import fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

print(fuzz.partial_ratio('apple', 'apple')) -> 100
print(fuzz.partial_ratio('apple', 'appel')) -> 83

But am not sure how to apply this to a column in dataframe to get relevant rows

Upvotes: 0

Views: 1791

Answers (1)

ggordon
ggordon

Reputation: 10035

Since you are interested in implementing fuzzy matching as a filter, you must first decide on a threshold of how similar you would like the matches.

Approach 1

For your fuzzywuzzy import this could be 80 for the purpose of this demonstration (adjust based on your needs). You could then implement a udf to apply your imported fuzzy logic code eg

from pyspark.sql import functions as F
from pyspark.sql import types as T

F.udf(T.BooleanType())
def is_fuzzy_match(field_value,search_value, threshold=80):
    from fuzzywuzzy import fuzz
    return fuzz.partial_ratio(field_value, search_value) > threshold

Then apply your udf as a filter on your dataframe

df = (
    df.where(
          is_fuzzy_match(F.col("fruit"),F.lit("apple"))  | 
          is_fuzzy_match(F.col("fruit"),F.lit("orange"))  
    )
)

Approach 2 : Recommended

However, udfs can be expensive when executed on spark and spark has already implemented the levenshtein function which is also useful here. You may start reading more about how the levenshtein distance accomplishes fuzzy matching .

With this approach your code code look like using a threshold of 3 below

from pyspark.sql import functions as F

df = df.where(
    (
        F.levenshtein(
            F.col("fruit"),
            F.lit("apple")
        ) < 3
     ) |
     (
        F.levenshtein(
            F.col("fruit"),
            F.lit("orange")
        ) < 3
     ) 
)

df.show()
+---+----+------+--------+
| id|city| fruit|quantity|
+---+----+------+--------+
|  0|  CA| apple|     300|
|  1|  CA| appel|     100|
|  2|  CA|orange|      20|
+---+----+------+--------+

For debugging purposes the result of the levenshtein has been included below

df.withColumn("diff",
    F.levenshtein(
        F.col("fruit"),
        F.lit("apple")
    )
).show()
+---+----+------+--------+----+
| id|city| fruit|quantity|diff|
+---+----+------+--------+----+
|  0|  CA| apple|     300|   0|
|  1|  CA| appel|     100|   2|
|  2|  CA|orange|      20|   5|
|  3|  CA| berry|      10|   5|
+---+----+------+--------+----+

Update 1

In response to additional sample data provided by Op in comments:

If I have a fruit like kashmir apple and want it to match with apple

Approach 3

You could try the following approach and adjust the threshold as desired.

Since you are interested in matching the possibility of a misspelled fruit throughout the text, you could attempt to apply the levenshtein to every piece throughout the fruit name. The functions below (not udfs but for readability simplifies the application of the task ) implement this approach. matches_fruit_ratio attempts to determine how much of a match is found while matches_fruit determines the maximum matches_fruit_ratio on every piece of the fruit name split by a space.


from pyspark.sql import functions as F

def matches_fruit_ratio(fruit_column,fruit_search,threshold=0.3):
    return (F.length(fruit_column) - F.levenshtein(
        fruit_column,
        F.lit(fruit_search)
    )) / F.length(fruit_column) 

def matches_fruit(fruit_column,fruit_search,threshold=0.6):
    return F.array_max(F.transform(
        F.split(fruit_column," "),
        lambda fruit_piece : matches_fruit_ratio(fruit_piece,fruit_search)
    )) >= threshold

This can be used as follows:

df = df.where(
    
    matches_fruit(
        F.col("fruit"),
        "apple"
    ) | matches_fruit(
        F.col("fruit"),
        "orange"
    )
)
df.show()
+---+----+-------------+--------+
| id|city|        fruit|quantity|
+---+----+-------------+--------+
|  0|  CA|        apple|     300|
|  1|  CA|        appel|     100|
|  2|  CA|       orange|      20|
|  4|  CA|  apply berry|       3|
|  5|  CA|  apple berry|       1|
|  6|  CA|kashmir apple|       5|
|  7|  CA|kashmir appel|       8|
+---+----+-------------+--------+

For debugging purposes, I have added additional sample data and output columns for the different components of each function while demonstrating how this function may be used

df.withColumn("length",
    F.length(
        "fruit"
    )
).withColumn("levenshtein",
    F.levenshtein(
        F.col("fruit"),
        F.lit("apple")
    )
).withColumn("length - levenshtein",
    F.length(
        "fruit"
    ) - F.levenshtein(
        F.col("fruit"),
        F.lit("apple")
    )
).withColumn(
    "matches_fruit_ratio",
    matches_fruit_ratio(
        F.col("fruit"),
        "apple"
    )
).withColumn(
    "matches_fruit_values_before_threshold",
    F.array_max(F.transform(
        F.split("fruit"," "),
        lambda fruit_piece : matches_fruit_ratio(fruit_piece,"apple")
    ))
).withColumn(
    "matches_fruit",
    matches_fruit(
        F.col("fruit"),
        "apple"
    )
).show()
+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+
| id|city|        fruit|quantity|length|levenshtein|length - levenshtein|matches_fruit_ratio|matches_fruit_values_before_threshold|matches_fruit|
+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+
|  0|  CA|        apple|     300|     5|          0|                   5|                1.0|                                  1.0|         true|
|  1|  CA|        appel|     100|     5|          2|                   3|                0.6|                                  0.6|         true|
|  2|  CA|       orange|      20|     6|          5|                   1|0.16666666666666666|                  0.16666666666666666|        false|
|  3|  CA|        berry|      10|     5|          5|                   0|                0.0|                                  0.0|        false|
|  4|  CA|  apply berry|       3|    11|          6|                   5|0.45454545454545453|                                  0.8|         true|
|  5|  CA|  apple berry|       1|    11|          6|                   5|0.45454545454545453|                                  1.0|         true|
|  6|  CA|kashmir apple|       5|    13|          8|                   5|0.38461538461538464|                                  1.0|         true|
|  7|  CA|kashmir appel|       8|    13|         10|                   3|0.23076923076923078|                                  0.6|         true|
+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+

Upvotes: 2

Related Questions