Rajesh Meher
Rajesh Meher

Reputation: 57

Use SequenceMatcher in array column in pyspark

I have a data frame with one array column ‘test’ in pyspark dataframe have 3 or more rows.

test ———————————————————————— [‘hello’,’hell’,’Help’,’helper’] [‘matter’,’matt’,mater’] [‘sequence’,’seque’]

How can I use difflib.sequencematcher to iterate over each element of rows and if ratio of two element less than 90% then append both the elements in a new column say ‘test_ratio and if it is greater then just keep one element from two?

Example : From first row Compare first two element ‘hello’ and ‘hell’ if ratio greater then 90% then append hello into test_ratio then compare hello with help if ratio less then 90% then just append hepl then compare hello with helper and since both are different then append helper and so on. Basically i want keep distinct element in array having similitude index less then 90%.

Upvotes: 0

Views: 959

Answers (1)

werner
werner

Reputation: 14905

You can create an udf that filters the arrays. The logic inside the udf compares the first element of each array with all other elements of the same array and keeps only those strings where ratio returns a value less or equal 0.9.

from pyspark.sql import functions as F
from pyspark.sql import types as T

data = [[['hello','hell','Help','helper']],[['matter','matt','mater']],[['sequence','seque']]]
df = spark.createDataFrame(data, schema=["test"])

@F.udf(returnType=T.ArrayType(T.StringType()))
def almost_distinct_values(inp):
    from difflib import SequenceMatcher
    return [inp[0]]+ [v for v in inp[1:] if SequenceMatcher(None, inp[0],v).ratio() <= 0.9]
    
df.withColumn("test_ratio", almost_distinct_values("test")).show(truncate=False)

Output:


df.withColumn("test_ratio", almost_distinct_values("test")).show(truncate=False)
+---------------------------+---------------------------+
|test                       |test_ratio                 |
+---------------------------+---------------------------+
|[hello, hell, Help, helper]|[hello, hell, Help, helper]|
|[matter, matt, mater]      |[matter, matt]             |
|[sequence, seque]          |[sequence, seque]          |
+---------------------------+---------------------------+

Upvotes: 1

Related Questions