A.HADDAD
A.HADDAD

Reputation: 1906

Deduplicate Spark Dataframe by Field

let's suppose that i have the following spark data-frame:

 -----------------------
 | geohash | timehash  |
 ----------------------- 
 | x       | y         |
 -----------------------
 | x       | z         |
 -----------------------
 | z       | y         |

is it possible to dudeplicate it by the geohash field and collect the result of the second field like this ? :

 -----------------------
 | geohash | timehash  |
 ---------------------- 
 | x       | y , z     |
 -----------------------
 | z       | y         |

Upvotes: 0

Views: 478

Answers (2)

Manoj Kumar Dhakad
Manoj Kumar Dhakad

Reputation: 1892

You can use groupBy and aggregate function to achieve this like below

df.groupBy("geohash").agg(collect_list("timehash")).alias("timehash").show

//output
+-------+--------+
|geohash|timehash|
+-------+--------+
|      x|  [y, z]|
|      z|     [y]|
+-------+--------+

Upvotes: 2

Avishek Bhattacharya
Avishek Bhattacharya

Reputation: 6974

You can get the desired result with aggregateByKey of reduceByKey. I haven't tested my code with the exact data you have provided. However the basic code should be like

val geoHashRdd = geoHashDF.map(row ⇒ (row.geohash, row.timehash)).rdd;
val reduceByKey = geoHashRdd.reduceByKey((a , b) => a.concat(b))

OR

geoHashRdd.aggregateByKey("")({case (aggr , value) => aggr + String.valueOf(value)}, (aggr1, aggr2) => aggr1 + aggr2)

Upvotes: 1

Related Questions