Reshma Raveendran
Reshma Raveendran

Reputation: 89

Group spark Dataset by month from a timestamp in java

I've loaded all rows from a table into Dataset using spark session in java. I want to get the count of rows in each month.

I tried to create new column of month by using withColumn() so that I can later use group_by month and count(). But I am not able to get month from timestamp. How can I find the count in each month from above dataset?

My sample Dataset will look like this,

enter image description here

Upvotes: 0

Views: 1316

Answers (2)

ShaksM
ShaksM

Reputation: 84

Considering the way you have explained your problem: I tried to create new column of month by using withColumn() so that I can later use groupBy() month and count(). But I am not able to get month from timestamp.

You can you the static month()function provided in org.apache.spark.sql.functions package to find the month, as below:

myDataset.withColumn("month", month(col("date"))).groupBy(col("month")).count().show()

where col("date") will have the timestamp (in below case : "yyyy-mm-dd HH:mm:ss" ).

Input used:

1,2019-04-07  07:24:14,0,8

2,2019-05-07  07:24:14,0,10

5,2019-06-07  07:24:14,0,6

3,2019-04-07  07:24:14,0,7

This will give you an output as below:

+-----+-----+
|month|count|
+-----+-----+
|    6|    1|
|    5|    1|
|    4|    2|
+-----+-----+

Hope this helps!!

Upvotes: 3

I believe you can use Tuple2<> type

Map<Date, Integer> = myDataSetRDD.map(x -> new Tuple2<Date, Integer>(x.getDate(), 1))
            .reduceByKey((x, v) -> x + v)
            .collectAsMap();

This way you end up with a map, that has dates as keys and count of those dates as values. I hope that helps

Upvotes: 1

Related Questions