Esmaeil zahedi
Esmaeil zahedi

Reputation: 381

Compare documents and remove duplication in Spark and Scala

Suppose i have these documents and i want to remove duplication :

buy sansa view sell product player charger world charge player charger receive 
oldest daughter teen daughter player christmas so daughter life line listen sooo hold
thourghly sansa view delete song time wont wont connect-computer computer put time 
oldest daughter teen daughter player christmas so daughter life line listen sooo hold
oldest daughter teen daughter player christmas so daughter life line listen sooo hold

this is output:

buy sansa view sell product player charger world charge player charger receive 
oldest daughter teen daughter player christmas so daughter life line listen sooo hold
thourghly sansa view delete song time wont wont connect-computer computer put time 

is there any solution for this in Scala and Spark?

Upvotes: 2

Views: 72

Answers (2)

Alister Lee
Alister Lee

Reputation: 2455

You seem to be reading the files on a line-wise basis so textFile will correctly read this into an RDD of strings, one row per line. After this, distinct will slim the RDD to a unique set.

sc.textFile("yourfile.txt")
  .distinct
  .saveAsTextFile("distinct.txt")

Upvotes: 1

Kaushal
Kaushal

Reputation: 3367

Using the reduceByKey function, you can achieve your requirement.

You can use this code

val textFile = spark.textFile("hdfs://...")
val uLine = textFile.map(line => (line, 1))
                 .reduceByKey(_ + _).map(uLine => uLine._1)
uLine.saveAsTextFile("hdfs://...") 

or you can use

val uLine = spark.textFile("hdfs://...").distinct
uLine.saveAsTextFile("hdfs://...")

Upvotes: 0

Related Questions