Reputation: 1092
I have a spark dataframe
id | city | fruit | quantity
-------------------------
0 | CA | apple | 300
1 | CA | appel | 100
2 | CA | orange| 20
3 | CA | berry | 10
I want to get rows where fruits are apple
or orange
. So I use Spark SQL:
SELECT * FROM table WHERE fruit LIKE '%apple%' OR fruit LIKE '%orange%';
It returns
id | city | fruit | quantity
-------------------------
0 | CA | apple | 300
2 | CA | orange| 20
But it is supposed to return
id | city | fruit | quantity
-------------------------
0 | CA | apple | 300
1 | CA | appel | 100
2 | CA | orange| 20
as row 1 is just a misspelling.
So I plan on using fuzzywuzzy
for string matching.
I know that
import fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
print(fuzz.partial_ratio('apple', 'apple')) -> 100
print(fuzz.partial_ratio('apple', 'appel')) -> 83
But am not sure how to apply this to a column in dataframe to get relevant rows
Upvotes: 0
Views: 1791
Reputation: 10035
Since you are interested in implementing fuzzy matching as a filter, you must first decide on a threshold of how similar you would like the matches.
For your fuzzywuzzy
import this could be 80
for the purpose of this demonstration (adjust based on your needs). You could then implement a udf to apply your imported fuzzy logic code eg
from pyspark.sql import functions as F
from pyspark.sql import types as T
F.udf(T.BooleanType())
def is_fuzzy_match(field_value,search_value, threshold=80):
from fuzzywuzzy import fuzz
return fuzz.partial_ratio(field_value, search_value) > threshold
Then apply your udf as a filter on your dataframe
df = (
df.where(
is_fuzzy_match(F.col("fruit"),F.lit("apple")) |
is_fuzzy_match(F.col("fruit"),F.lit("orange"))
)
)
However, udfs can be expensive when executed on spark and spark has already implemented the levenshtein
function which is also useful here. You may start reading more about how the levenshtein distance accomplishes fuzzy matching .
With this approach your code code look like using a threshold of 3
below
from pyspark.sql import functions as F
df = df.where(
(
F.levenshtein(
F.col("fruit"),
F.lit("apple")
) < 3
) |
(
F.levenshtein(
F.col("fruit"),
F.lit("orange")
) < 3
)
)
df.show()
+---+----+------+--------+
| id|city| fruit|quantity|
+---+----+------+--------+
| 0| CA| apple| 300|
| 1| CA| appel| 100|
| 2| CA|orange| 20|
+---+----+------+--------+
For debugging purposes the result of the levenshtein has been included below
df.withColumn("diff",
F.levenshtein(
F.col("fruit"),
F.lit("apple")
)
).show()
+---+----+------+--------+----+
| id|city| fruit|quantity|diff|
+---+----+------+--------+----+
| 0| CA| apple| 300| 0|
| 1| CA| appel| 100| 2|
| 2| CA|orange| 20| 5|
| 3| CA| berry| 10| 5|
+---+----+------+--------+----+
In response to additional sample data provided by Op in comments:
If I have a fruit like kashmir apple and want it to match with apple
You could try the following approach and adjust the threshold as desired.
Since you are interested in matching the possibility of a misspelled fruit throughout the text, you could attempt to apply the levenshtein to every piece throughout the fruit name. The functions below (not udfs but for readability simplifies the application of the task ) implement this approach. matches_fruit_ratio
attempts to determine how much of a match is found while matches_fruit
determines the maximum matches_fruit_ratio
on every piece of the fruit name split by a space.
from pyspark.sql import functions as F
def matches_fruit_ratio(fruit_column,fruit_search,threshold=0.3):
return (F.length(fruit_column) - F.levenshtein(
fruit_column,
F.lit(fruit_search)
)) / F.length(fruit_column)
def matches_fruit(fruit_column,fruit_search,threshold=0.6):
return F.array_max(F.transform(
F.split(fruit_column," "),
lambda fruit_piece : matches_fruit_ratio(fruit_piece,fruit_search)
)) >= threshold
This can be used as follows:
df = df.where(
matches_fruit(
F.col("fruit"),
"apple"
) | matches_fruit(
F.col("fruit"),
"orange"
)
)
df.show()
+---+----+-------------+--------+
| id|city| fruit|quantity|
+---+----+-------------+--------+
| 0| CA| apple| 300|
| 1| CA| appel| 100|
| 2| CA| orange| 20|
| 4| CA| apply berry| 3|
| 5| CA| apple berry| 1|
| 6| CA|kashmir apple| 5|
| 7| CA|kashmir appel| 8|
+---+----+-------------+--------+
For debugging purposes, I have added additional sample data and output columns for the different components of each function while demonstrating how this function may be used
df.withColumn("length",
F.length(
"fruit"
)
).withColumn("levenshtein",
F.levenshtein(
F.col("fruit"),
F.lit("apple")
)
).withColumn("length - levenshtein",
F.length(
"fruit"
) - F.levenshtein(
F.col("fruit"),
F.lit("apple")
)
).withColumn(
"matches_fruit_ratio",
matches_fruit_ratio(
F.col("fruit"),
"apple"
)
).withColumn(
"matches_fruit_values_before_threshold",
F.array_max(F.transform(
F.split("fruit"," "),
lambda fruit_piece : matches_fruit_ratio(fruit_piece,"apple")
))
).withColumn(
"matches_fruit",
matches_fruit(
F.col("fruit"),
"apple"
)
).show()
+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+
| id|city| fruit|quantity|length|levenshtein|length - levenshtein|matches_fruit_ratio|matches_fruit_values_before_threshold|matches_fruit|
+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+
| 0| CA| apple| 300| 5| 0| 5| 1.0| 1.0| true|
| 1| CA| appel| 100| 5| 2| 3| 0.6| 0.6| true|
| 2| CA| orange| 20| 6| 5| 1|0.16666666666666666| 0.16666666666666666| false|
| 3| CA| berry| 10| 5| 5| 0| 0.0| 0.0| false|
| 4| CA| apply berry| 3| 11| 6| 5|0.45454545454545453| 0.8| true|
| 5| CA| apple berry| 1| 11| 6| 5|0.45454545454545453| 1.0| true|
| 6| CA|kashmir apple| 5| 13| 8| 5|0.38461538461538464| 1.0| true|
| 7| CA|kashmir appel| 8| 13| 10| 3|0.23076923076923078| 0.6| true|
+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+
Upvotes: 2