Reputation: 1353
I have 2 datasets(tables) with 35kk+ rows.
I try to join(or group by) this datasets by some id. (in common it will be one-to-one)
But this operation takes a lot time: 25+ h.
Filters only works fine: ~20 mins.
Env: emr-5.3.1
Hadoop distribution:Amazon
Applications:Ganglia 3.7.2, Spark 2.1.0, Zeppelin 0.6.2
Instance type: m3.xlarge
Code (groupBy):
Dataset<Row> dataset = ...
...
.groupBy("id")
.agg(functions.min("date"))
.withColumnRenamed("min(date)", "minDate")
Code (join):
...
.join(dataset2, dataset.col("id").equalTo(dataset2.col("id")))
Also I found this message in EMR logs:
HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current version of codegened fast hashmap does not support this aggregate.
Upvotes: 0
Views: 3634
Reputation: 1023
There Might be a possibility of Data getting Skewed. We faced this. Check your joining column. This happens mostly if your joining column has NULLS
Check Data Stored pattern with :
select joining_column, count(joining_col) from <tablename>
group by joining_col
This will give you an idea that whether the data in your joining column is Evenly distributed
Upvotes: 1