Aplly UDF to subsets of pyspark dataframe

Question

I have a Dataframe like the following, containing two sorted lists of strings for each possible combination of key1 and key2.

df=
+----+------------+-------+-------+
|key1|        key2| value1| value2|
+----+------------+-------+-------+
| 'a'|  '10,0,10' |  'abc'|  'abc'|
| 'a'|  '10,0,10' |  'aab'|  'aab'|
| 'a'|  '10,0,10' |  'acb'|  'acb'|
| 'a'|  '10,0,20' |  'abc'|  'abc'|
| 'a'|  '10,0,20' |  'acb'|  'aab'|
| 'a'|  '10,0,20' |  'aab'|  'acb'|
| 'b'|  '10,0,10' |  'bcd'|  'bcd'|
| 'b'|  '10,0,10' |  'bbc'|  'bdc'|
| 'b'|  '10,0,10' |  'bdc'|  'bbc'|
|...

Now I want to apply a funcion like this:

for c in [x for x in df.select('key1').distinct().collect()]:
    for s in [x for x in df.select('key2').distinct().collect()]:
       jaccard_sim([x for x in df.select('value1').filter(df['key1']==c).filter(df['key2']==s).collect()], 
              [x for x in df.select('value2').filter(df['key1']==c).filter(df['key2']==s).collect()])

But since I want to use sparks ability to parallelize the execution I think the above implementation might be kind of stupid;) Has anyone have an idea how to solve it?

The background is that I have a sorted list (value1) per key1 and key2 combination which I want to compare to a benchmark list per key 1 (value2) and calculate the jaccard similarity between the lists. If anyone has in general a (better) suggestion on how to do this with pyspark I would really apprechicate it! Thanks:)

mayank agrawal · Accepted Answer

You can approach like this,

import pyspark.sql.functions as F

def convert_form(x):
    print type(x)
    val1 = [y['value1'] for y in x]
    val2 = [y['value2'] for y in x]
    return [val1, val2]

jaccard_udf = F.udf(lambda x: jaccard_sim(*convert_form(x)) ) #assuming you have jaccard_sim function

df = df.select('key1', 'key2', F.struct('value1','value2').alias('values'))\
       .groupby('key1', 'key2').agg(F.collect_list('values').alias('collected_col'))\
       .withColumn('jaccard_distance', jaccard_udf(F.col('collected_col')) )

df.show()

Aplly UDF to subsets of pyspark dataframe

Answers (1)

Related Questions