Reputation: 3896
I have an RDD and I want to find distinct values for multiple columns.
Example:
Row(col1=a, col2=b, col3=1), Row(col1=b, col2=2, col3=10)), Row(col1=a1, col2=4, col3=10)
I would like to find have a map:
col1=[a,b,a1]
col2=[b,2,4]
col3=[1,10]
Can dataframe help compute it faster/simpler?
My solution with RDD was:
def to_uniq_vals(row):
return [(k,v) for k,v in row.items()]
rdd.flatMap(to_uniq_vals).distinct().collect()
Thanks
Upvotes: 2
Views: 14445
Reputation: 556
You can use drop duplicates and then select the same columns. Might not be the most efficient way but still a decent way:
df.dropDuplicates("col1","col2", .... "colN").select("col1","col2", .... "colN").toJSON
** Works well using Scala
Upvotes: 4
Reputation: 691
I hope I understand your question correctly; You can try the following:
import org.apache.spark.sql.{functions => F}
val df = Seq(("a", 1, 1), ("b", 2, 10), ("a1", 4, 10))
df.select(F.collect_set("_1"), F.collect_set("_2"), F.collect_set("_3")).show
Results:
+---------------+---------------+---------------+
|collect_set(_1)|collect_set(_2)|collect_set(_3)|
+---------------+---------------+---------------+
| [a1, b, a]| [1, 2, 4]| [1, 10]|
+---------------+---------------+---------------+
The code above should be more efficient than the purposed select distinct
column-by-column for several reasons:
Hope it helps!
Upvotes: 6