Reputation: 89
I've loaded all rows from a table into Dataset using spark session in java. I want to get the count of rows in each month.
I tried to create new column of month by using withColumn() so that I can later use group_by month and count(). But I am not able to get month from timestamp. How can I find the count in each month from above dataset?
My sample Dataset will look like this,
Upvotes: 0
Views: 1316
Reputation: 84
Considering the way you have explained your problem:
I tried to create new column of month by using withColumn()
so that I can later use groupBy()
month and count()
. But I am not able to get month from timestamp.
You can you the static month()
function provided in org.apache.spark.sql.functions
package to find the month, as below:
myDataset.withColumn("month", month(col("date"))).groupBy(col("month")).count().show()
where col("date")
will have the timestamp (in below case : "yyyy-mm-dd HH:mm:ss"
).
Input used:
1,2019-04-07 07:24:14,0,8
2,2019-05-07 07:24:14,0,10
5,2019-06-07 07:24:14,0,6
3,2019-04-07 07:24:14,0,7
This will give you an output as below:
+-----+-----+
|month|count|
+-----+-----+
| 6| 1|
| 5| 1|
| 4| 2|
+-----+-----+
Hope this helps!!
Upvotes: 3
Reputation: 59
I believe you can use Tuple2<> type
Map<Date, Integer> = myDataSetRDD.map(x -> new Tuple2<Date, Integer>(x.getDate(), 1))
.reduceByKey((x, v) -> x + v)
.collectAsMap();
This way you end up with a map, that has dates as keys and count of those dates as values. I hope that helps
Upvotes: 1