user622194
user622194

Reputation:

Get distinct items from rows of comma separated strings in Spark 2.0

I am using Spark 2.0 to analyze a data set. One column contains string data like this:

A,C
A,B
A
B
B,C

I want to get a JavaRDD with all distinct items that appears in the column, something like this:

A
B
C 

How can this be done efficiently in spark? I am using Spark with Java, but Scala examples or pointers would be useful.

Edit: I have tried using flatMap, but my implementation is very slow.

JavaRDD<String> d = dataset.flatMap(s -> Arrays.asList(s.split(",")).iterator())

Upvotes: 2

Views: 2406

Answers (3)

KiranM
KiranM

Reputation: 1323

I know lengthy, but please try this if you want execution done completely at executor nodes rather than worker.

dataset
.flatMap(x => x.split(","))
.map(x => (x,1))
.sortByKey()
.reduceByKey(_+_)
.map(x=>x._1)

Add this if you get error with data format:

dataset.map(x=>(x._1+","+x._2+","+x._3))

Curious to know your findings.

Upvotes: -1

Glennie Helles Sindholt
Glennie Helles Sindholt

Reputation: 13154

I'm not sure what you mean by "slow". Presumably you have a very large dataset (otherwise you wouldn't need Spark) so "slow" is relative. However, I would simply do

dataset.flatMap(s -> s.split(",")).distinct

Upvotes: 2

Yaron
Yaron

Reputation: 10450

try using:

1) explode: https://spark.apache.org/docs/2.0.0/api/java/ org.apache.spark.sql.functions.explode(Column col)

static Column   explode(Column e)

Explode - Creates a new row for each element in the given array or map column.

2) Then execute "distinct" on this column:

http://spark.apache.org/docs/latest/programming-guide.html

distinct([numTasks])) Return a new dataset that contains the distinct elements of the source dataset.

Summary

Explode will result with one item per row (in this specific column).

Distinct will leave only the distinct items

Upvotes: 0

Related Questions