Remove duplicates by line in Spark RDD

Question

I am doing some work with the Pyspark MLlib FPGrowth algorithm and have a rdd with repeated examples of duplicate transactions contained within each line. This is causing the model training function to throw an error due to these duplicates. I'm fairly new to Spark and am wondering how to remove duplicates within the line of an rdd. As an example:

 #simple example
    from pyspark.mllib.fpm import FPGrowth

    data = [["a", "a", "b", "c"], ["a", "b", "d", "e"], ["a", "a", "c", "e"], ["a", "c", "f"]]
    rdd = sc.parallelize(data)
    model = FPGrowth.train(rdd, 0.6, 2)
    freqit = model.freqItemsets()
    freqit.collect()

So that it looks like:

#simple example
from pyspark.mllib.fpm import FPGrowth

data_dedup = [["a", "b", "c"], ["a", "b", "d", "e"], ["a", "c", "e"], ["a", "c", "f"]]
rdd = sc.parallelize(data_dedup)
model = FPGrowth.train(rdd, 0.6, 2)
freqit = model.freqItemsets()
freqit.collect()

And will run without error.

Thanks in advance!

Thiago Baldim · Accepted Answer

Use like this:

rdd = rdd.map(lambda x: list(set(x)))

This will remove the duplicated.

Remove duplicates by line in Spark RDD

Answers (1)

Related Questions