Reputation: 381
Suppose i have these documents and i want to remove duplication :
buy sansa view sell product player charger world charge player charger receive
oldest daughter teen daughter player christmas so daughter life line listen sooo hold
thourghly sansa view delete song time wont wont connect-computer computer put time
oldest daughter teen daughter player christmas so daughter life line listen sooo hold
oldest daughter teen daughter player christmas so daughter life line listen sooo hold
this is output:
buy sansa view sell product player charger world charge player charger receive
oldest daughter teen daughter player christmas so daughter life line listen sooo hold
thourghly sansa view delete song time wont wont connect-computer computer put time
is there any solution for this in Scala and Spark?
Upvotes: 2
Views: 72
Reputation: 2455
You seem to be reading the files on a line-wise basis so textFile
will correctly read this into an RDD of strings, one row per line. After this, distinct
will slim the RDD to a unique set.
sc.textFile("yourfile.txt")
.distinct
.saveAsTextFile("distinct.txt")
Upvotes: 1
Reputation: 3367
Using the reduceByKey function, you can achieve your requirement.
You can use this code
val textFile = spark.textFile("hdfs://...")
val uLine = textFile.map(line => (line, 1))
.reduceByKey(_ + _).map(uLine => uLine._1)
uLine.saveAsTextFile("hdfs://...")
or you can use
val uLine = spark.textFile("hdfs://...").distinct
uLine.saveAsTextFile("hdfs://...")
Upvotes: 0