AlexV
AlexV

Reputation: 3896

How to find distinct values of multiple columns in Spark

I have an RDD and I want to find distinct values for multiple columns.

Example:

Row(col1=a, col2=b, col3=1), Row(col1=b, col2=2, col3=10)), Row(col1=a1, col2=4, col3=10)

I would like to find have a map:

col1=[a,b,a1]
col2=[b,2,4]
col3=[1,10]

Can dataframe help compute it faster/simpler?

Update:

My solution with RDD was:


def to_uniq_vals(row):
   return [(k,v) for k,v in row.items()]

rdd.flatMap(to_uniq_vals).distinct().collect()

Thanks

Upvotes: 2

Views: 14445

Answers (2)

Aman Sehgal
Aman Sehgal

Reputation: 556

You can use drop duplicates and then select the same columns. Might not be the most efficient way but still a decent way:

df.dropDuplicates("col1","col2", .... "colN").select("col1","col2", .... "colN").toJSON

** Works well using Scala

Upvotes: 4

Elior Malul
Elior Malul

Reputation: 691

I hope I understand your question correctly; You can try the following:

import org.apache.spark.sql.{functions => F}
val df = Seq(("a", 1, 1), ("b", 2, 10), ("a1", 4, 10))
df.select(F.collect_set("_1"), F.collect_set("_2"), F.collect_set("_3")).show

Results:

+---------------+---------------+---------------+
|collect_set(_1)|collect_set(_2)|collect_set(_3)|
+---------------+---------------+---------------+
|     [a1, b, a]|      [1, 2, 4]|        [1, 10]|
+---------------+---------------+---------------+

The code above should be more efficient than the purposed select distinct column-by-column for several reasons:

  1. Less workers-host round trips.
  2. De-duping should be done locally on the worker prior to inter-worker de-doupings.

Hope it helps!

Upvotes: 6

Related Questions