Making data ready for FP growth in pyspark

Question

I am trying to implement FP growth algorith. I have data in following format:

Food        rank
apple       1
caterpillar 1
banana      2
monkey      2
dog         3
bone        3
oath        3

How do I transform it into [[apple,caterpillar],[banana,monkey],[dog,bone,oath]]?

mtoto · Accepted Answer

Assuming your data is a DataFrame, we first convert it to an rdd, then define the key's, use them to group your data and finally map the values into a list and extract them. We can do this two ways, either use groupByKey():

(df.rdd
 .map(lambda x: (x[1],x[0]))
 .groupByKey()
 .mapValues(list)
 .values())

Or use reduceByKey(), which is going to be more efficient:

(df.rdd
 .map(lambda x: (x[1],[x[0]]))
 .reduceByKey(lambda x,y: x+y)
 .values())

Data:

df = sc.parallelize([("apple", 1),
                     ("caterpillar", 1),
                     ("banana", 2),
                     ("monkey", 2),
                     ("dog", 3),
                     ("bone", 3),
                     ("oath", 3)]).toDF(["Food", "rank"])

Making data ready for FP growth in pyspark

Answers (1)

Related Questions