Compare documents and remove duplication in Spark and Scala

Question

Suppose i have these documents and i want to remove duplication :

buy sansa view sell product player charger world charge player charger receive 
oldest daughter teen daughter player christmas so daughter life line listen sooo hold
thourghly sansa view delete song time wont wont connect-computer computer put time 
oldest daughter teen daughter player christmas so daughter life line listen sooo hold
oldest daughter teen daughter player christmas so daughter life line listen sooo hold

this is output:

buy sansa view sell product player charger world charge player charger receive 
oldest daughter teen daughter player christmas so daughter life line listen sooo hold
thourghly sansa view delete song time wont wont connect-computer computer put time

is there any solution for this in Scala and Spark?

Alister Lee · Accepted Answer

You seem to be reading the files on a line-wise basis so textFile will correctly read this into an RDD of strings, one row per line. After this, distinct will slim the RDD to a unique set.

sc.textFile("yourfile.txt")
  .distinct
  .saveAsTextFile("distinct.txt")

Compare documents and remove duplication in Spark and Scala

Answers (2)

Related Questions