Reputation: 71
I am doing some work with the Pyspark MLlib FPGrowth algorithm and have a rdd with repeated examples of duplicate transactions contained within each line. This is causing the model training function to throw an error due to these duplicates. I'm fairly new to Spark and am wondering how to remove duplicates within the line of an rdd. As an example:
#simple example
from pyspark.mllib.fpm import FPGrowth
data = [["a", "a", "b", "c"], ["a", "b", "d", "e"], ["a", "a", "c", "e"], ["a", "c", "f"]]
rdd = sc.parallelize(data)
model = FPGrowth.train(rdd, 0.6, 2)
freqit = model.freqItemsets()
freqit.collect()
So that it looks like:
#simple example
from pyspark.mllib.fpm import FPGrowth
data_dedup = [["a", "b", "c"], ["a", "b", "d", "e"], ["a", "c", "e"], ["a", "c", "f"]]
rdd = sc.parallelize(data_dedup)
model = FPGrowth.train(rdd, 0.6, 2)
freqit = model.freqItemsets()
freqit.collect()
And will run without error.
Thanks in advance!
Upvotes: 0
Views: 1279
Reputation: 7732
Use like this:
rdd = rdd.map(lambda x: list(set(x)))
This will remove the duplicated.
Upvotes: 1