Reputation:
I am using Spark 2.0 to analyze a data set. One column contains string data like this:
A,C
A,B
A
B
B,C
I want to get a JavaRDD with all distinct items that appears in the column, something like this:
A
B
C
How can this be done efficiently in spark? I am using Spark with Java, but Scala examples or pointers would be useful.
Edit: I have tried using flatMap, but my implementation is very slow.
JavaRDD<String> d = dataset.flatMap(s -> Arrays.asList(s.split(",")).iterator())
Upvotes: 2
Views: 2406
Reputation: 1323
I know lengthy, but please try this if you want execution done completely at executor nodes rather than worker.
dataset
.flatMap(x => x.split(","))
.map(x => (x,1))
.sortByKey()
.reduceByKey(_+_)
.map(x=>x._1)
Add this if you get error with data format:
dataset.map(x=>(x._1+","+x._2+","+x._3))
Curious to know your findings.
Upvotes: -1
Reputation: 13154
I'm not sure what you mean by "slow". Presumably you have a very large dataset (otherwise you wouldn't need Spark) so "slow" is relative. However, I would simply do
dataset.flatMap(s -> s.split(",")).distinct
Upvotes: 2
Reputation: 10450
try using:
1) explode: https://spark.apache.org/docs/2.0.0/api/java/ org.apache.spark.sql.functions.explode(Column col)
static Column explode(Column e)
Explode - Creates a new row for each element in the given array or map column.
2) Then execute "distinct" on this column:
http://spark.apache.org/docs/latest/programming-guide.html
distinct([numTasks])) Return a new dataset that contains the distinct elements of the source dataset.
Summary
Explode will result with one item per row (in this specific column).
Distinct will leave only the distinct items
Upvotes: 0