Reputation: 2653
I have a PySpark dataframe that has 10000s of rows. The columns are (1) number of items occurring together as string and (2) the frequency of occurrence. I want to get how many times each items occurs with the rest of the items. I have sample data below.
pd.DataFrame({'items':['[a,b,c]','[c,d,e]', '[a,e]','[a,b,e]','[b,c]', '[c,d,e,b]'],'freq':[10, 16,7,2,6,5]})
The expected output is something like below.
[(a, b), 12], [(a, c), 10], [(c, d), 21], etc
Upvotes: 0
Views: 230
Reputation: 14905
Using this answer all possible combinations of two letters can be calculated. Then the result is exploded, grouped and summed up:
from pyspark.sql import functions as F
df.withColumn("items_array", F.expr('split(regexp_replace(items, "[\\\\[\\\\]]", ""), ",")', )) \
.withColumn(
"out",
F.expr("""
filter(
transform(
flatten(transform(
items_array,
x -> arrays_zip(array_repeat(x, size(items_array)), items_array)
)),
x -> array(x["0"], x["items_array"])
),
x -> x[0] < x[1]
)
""")
) \
.withColumn("out", F.explode("out")) \
.groupBy("out") \
.agg(F.sum("freq")) \
.orderBy("out") \
.show()
Output:
+------+---------+
| out|sum(freq)|
+------+---------+
|[a, b]| 12|
|[a, c]| 10|
|[a, e]| 9|
|[b, c]| 21|
|[b, d]| 5|
|[b, e]| 7|
|[c, d]| 21|
|[c, e]| 21|
|[d, e]| 21|
+------+---------+
Upvotes: 1