How to perform group and count using map and reduce function in Spark 2.3.1

Question

I am a new bee to spark and I am trying to perform a group by and count using the following spark functions:

 Dataset result =  dataset
       .groupBy("column1", "column2")
       .count();

But I read here that using group by is not a good idea since it does not have a combiner, which in turn affects the spark job's runtime efficiency. Instead, one should use reduceByKey function for aggregation operations.

So I tried using reduceByKey function, but it is not available for dataset. Instead, datasets use reduce(ReduceFunction func).

Since I can not find an example to perform group and count with reduce function, I tried converting it to JavaRDD and used reduceByKey:

//map each row to 1 and then group them by key 
JavaPairRDD mapOnes;
            try {
                 mapOnes = dailySummary.javaRDD().mapToPair(
                        new PairFunction() {
                            @Override
                            public Tuple2 call(Row t) throws Exception {
                                return new Tuple2(new String[]{t.getAs("column1"), t.getAs("column2")}, 1);
                            }   
                });
            }catch(Exception e) {
                log.error("exception in mapping ones: "+e);
                throw new Exception();
            }


        JavaPairRDD rowCount;
        try {
            rowCount = mapOnes.reduceByKey(
                new Function2() {

                    @Override
                    public Integer call(Integer v1, Integer v2) throws Exception {
                        return v1+v2;
                    }
                });
        }catch(Exception e) {
            log.error("exception in reduce by key: "+e);
            throw new Exception();
        }

But this is also giving exception as org.apache.spark.SparkException: Task not serializable for mapToPair function.

Can anyone suggest a better way to group and perform count using dataset's reduce and map function.

Any help is appreciated.

Assaf Mendelson · Accepted Answer

The groupBy in the link you added refers to RDD. In the RDD semantics, groupBy would basically shuffle all the data according to the key, i.e. it would bring ALL values relating to the key to one place.

This is why reduceByKey is suggested as reduceByKey first performs the reduce operation on each partition and only the reduced value is shuffled which means a lot less traffic (and prevents out of memory issues with bringing everything to one partition).

In Datasets, groupBy behaves differently. It does not give a dataset as a returned object but instead a KeyValueGroupedDataset object. When you do count on this object (or the more generic agg), it basically defines a reducer which works very similar to reduceByKey.

This means there is no need for a separate reduceByKey method (the dataset groupby is actually a form of reduceByKey).

Stick with the original groupBy(...).count(...)

How to perform group and count using map and reduce function in Spark 2.3.1

Answers (2)

Related Questions