Reputation: 1906
let's suppose that i have the following spark data-frame:
-----------------------
| geohash | timehash |
-----------------------
| x | y |
-----------------------
| x | z |
-----------------------
| z | y |
is it possible to dudeplicate it by the geohash field and collect the result of the second field like this ? :
-----------------------
| geohash | timehash |
----------------------
| x | y , z |
-----------------------
| z | y |
Upvotes: 0
Views: 478
Reputation: 1892
You can use groupBy
and aggregate
function to achieve this like below
df.groupBy("geohash").agg(collect_list("timehash")).alias("timehash").show
//output
+-------+--------+
|geohash|timehash|
+-------+--------+
| x| [y, z]|
| z| [y]|
+-------+--------+
Upvotes: 2
Reputation: 6974
You can get the desired result with aggregateByKey
of reduceByKey
. I haven't tested my code with the exact data you have provided. However the basic code should be like
val geoHashRdd = geoHashDF.map(row ⇒ (row.geohash, row.timehash)).rdd;
val reduceByKey = geoHashRdd.reduceByKey((a , b) => a.concat(b))
OR
geoHashRdd.aggregateByKey("")({case (aggr , value) => aggr + String.valueOf(value)}, (aggr1, aggr2) => aggr1 + aggr2)
Upvotes: 1