How to find distinct values of multiple columns in Spark

Question

I have an RDD and I want to find distinct values for multiple columns.

Example:

Row(col1=a, col2=b, col3=1), Row(col1=b, col2=2, col3=10)), Row(col1=a1, col2=4, col3=10)

I would like to find have a map:

col1=[a,b,a1]
col2=[b,2,4]
col3=[1,10]

Can dataframe help compute it faster/simpler?

Update:

My solution with RDD was:


def to_uniq_vals(row):
   return [(k,v) for k,v in row.items()]

rdd.flatMap(to_uniq_vals).distinct().collect()

Thanks

Elior Malul · Accepted Answer

I hope I understand your question correctly; You can try the following:

import org.apache.spark.sql.{functions => F}
val df = Seq(("a", 1, 1), ("b", 2, 10), ("a1", 4, 10))
df.select(F.collect_set("_1"), F.collect_set("_2"), F.collect_set("_3")).show

Results:

+---------------+---------------+---------------+
|collect_set(_1)|collect_set(_2)|collect_set(_3)|
+---------------+---------------+---------------+
|     [a1, b, a]|      [1, 2, 4]|        [1, 10]|
+---------------+---------------+---------------+

The code above should be more efficient than the purposed select distinct column-by-column for several reasons:

Less workers-host round trips.
De-duping should be done locally on the worker prior to inter-worker de-doupings.

Hope it helps!

How to find distinct values of multiple columns in Spark

Update:

Answers (2)

Related Questions