Get count of items occurring together in PySpark

Question

I have a PySpark dataframe that has 10000s of rows. The columns are (1) number of items occurring together as string and (2) the frequency of occurrence. I want to get how many times each items occurs with the rest of the items. I have sample data below.

pd.DataFrame({'items':['[a,b,c]','[c,d,e]', '[a,e]','[a,b,e]','[b,c]', '[c,d,e,b]'],'freq':[10, 16,7,2,6,5]})

The expected output is something like below.

[(a, b), 12], [(a, c), 10], [(c, d), 21], etc

werner · Accepted Answer

Using this answer all possible combinations of two letters can be calculated. Then the result is exploded, grouped and summed up:

from pyspark.sql import functions as F

df.withColumn("items_array", F.expr('split(regexp_replace(items, "[\$$\$$]", ""), ",")', )) \
  .withColumn(
    "out",
    F.expr("""
        filter(
            transform(
                flatten(transform(
                    items_array,
                    x -> arrays_zip(array_repeat(x, size(items_array)), items_array)
                )),
                x -> array(x["0"], x["items_array"])
            ),
            x -> x[0] < x[1]
        )
    """)
  ) \
  .withColumn("out", F.explode("out")) \
  .groupBy("out") \
  .agg(F.sum("freq")) \
  .orderBy("out") \
  .show()

Output:

+------+---------+
|   out|sum(freq)|
+------+---------+
|[a, b]|       12|
|[a, c]|       10|
|[a, e]|        9|
|[b, c]|       21|
|[b, d]|        5|
|[b, e]|        7|
|[c, d]|       21|
|[c, e]|       21|
|[d, e]|       21|
+------+---------+

Get count of items occurring together in PySpark

Answers (1)

Related Questions