JavaRDD equivalent to GROUP BY

Question

I have a CSV dataset with the following columns (Accident_Id, Date, Area) and hundred of rows. What I want to achieve is to group by the Area column into the possible unique groups and find the count of each.

I know how to do this with SQLContext but I am not sure how it's achievable with JavaRDD and it's actions(map, reduce, etc...)

SparkConf conf = new SparkConf().setAppName("test").setMaster("local[2]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD data = sc.textFile(pathToCSV);

...
sqlContext.sql("SELECT COUNT(Area) FROM my_table GROUP BY Area").show();

ernest_k · Accepted Answer

You can simply make a pair RDD and use it to count by its keys.

The following just assumes a String RDD with comma-separated records:

Map areaCounts = 
    data.mapToPair(s -> new scala.Tuple2<>(s.split(",")[2], 1L)).countByKey();

And that will give you the area -> count map.

If you prefer to implement your reduction logic by hand, you can use reduceByKey:

Map areaCounts = 
    data.mapToPair(s -> new scala.Tuple2<>(s.split(",")[2], 1L))
            .reduceByKey((l1, l2) -> l1 + l2).collectAsMap();

JavaRDD equivalent to GROUP BY

Answers (1)

Related Questions