Reputation: 35
I have a CSV dataset with the following columns (Accident_Id, Date, Area) and hundred of rows. What I want to achieve is to group by the Area column into the possible unique groups and find the count of each.
I know how to do this with SQLContext but I am not sure how it's achievable with JavaRDD and it's actions(map, reduce, etc...)
SparkConf conf = new SparkConf().setAppName("test").setMaster("local[2]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> data = sc.textFile(pathToCSV);
...
sqlContext.sql("SELECT COUNT(Area) FROM my_table GROUP BY Area").show();
Upvotes: 1
Views: 1114
Reputation: 45319
You can simply make a pair RDD and use it to count by its keys.
The following just assumes a String RDD with comma-separated records:
Map<String, Long> areaCounts =
data.mapToPair(s -> new scala.Tuple2<>(s.split(",")[2], 1L)).countByKey();
And that will give you the area -> count
map.
If you prefer to implement your reduction logic by hand, you can use reduceByKey
:
Map<String, Long> areaCounts =
data.mapToPair(s -> new scala.Tuple2<>(s.split(",")[2], 1L))
.reduceByKey((l1, l2) -> l1 + l2).collectAsMap();
Upvotes: 1