DaBenski
DaBenski

Reputation: 71

Remove duplicates by line in Spark RDD

I am doing some work with the Pyspark MLlib FPGrowth algorithm and have a rdd with repeated examples of duplicate transactions contained within each line. This is causing the model training function to throw an error due to these duplicates. I'm fairly new to Spark and am wondering how to remove duplicates within the line of an rdd. As an example:

 #simple example
    from pyspark.mllib.fpm import FPGrowth

    data = [["a", "a", "b", "c"], ["a", "b", "d", "e"], ["a", "a", "c", "e"], ["a", "c", "f"]]
    rdd = sc.parallelize(data)
    model = FPGrowth.train(rdd, 0.6, 2)
    freqit = model.freqItemsets()
    freqit.collect()

So that it looks like:

#simple example
from pyspark.mllib.fpm import FPGrowth

data_dedup = [["a", "b", "c"], ["a", "b", "d", "e"], ["a", "c", "e"], ["a", "c", "f"]]
rdd = sc.parallelize(data_dedup)
model = FPGrowth.train(rdd, 0.6, 2)
freqit = model.freqItemsets()
freqit.collect()

And will run without error.

Thanks in advance!

Upvotes: 0

Views: 1279

Answers (1)

Thiago Baldim
Thiago Baldim

Reputation: 7732

Use like this:

rdd = rdd.map(lambda x: list(set(x)))

This will remove the duplicated.

Upvotes: 1

Related Questions