Reputation: 3728
I am trying to implement FP growth algorith. I have data in following format:
Food rank
apple 1
caterpillar 1
banana 2
monkey 2
dog 3
bone 3
oath 3
How do I transform it into [[apple,caterpillar],[banana,monkey],[dog,bone,oath]]
?
Upvotes: 0
Views: 321
Reputation: 24198
Assuming your data is a DataFrame
, we first convert it to an rdd
, then define the key
's, use them to group your data and finally map
the values into a list
and extract them. We can do this two ways, either use groupByKey()
:
(df.rdd
.map(lambda x: (x[1],x[0]))
.groupByKey()
.mapValues(list)
.values())
Or use reduceByKey()
, which is going to be more efficient:
(df.rdd
.map(lambda x: (x[1],[x[0]]))
.reduceByKey(lambda x,y: x+y)
.values())
Data:
df = sc.parallelize([("apple", 1),
("caterpillar", 1),
("banana", 2),
("monkey", 2),
("dog", 3),
("bone", 3),
("oath", 3)]).toDF(["Food", "rank"])
Upvotes: 1