Vijay_Shinde
Vijay_Shinde

Reputation: 1352

How to group by on epoch timestame field in Scala spark

I want to group by the records by date. but the date is in epoch timestamp in millisec. Here is the sample data.

date,   Col1
1506838074000,  a
1506868446000,  b
1506868534000,  c
1506869064000,  a
1506869211000,  c
1506871846000,  f
1506874462000,  g
1506879651000,  a

Here is what I'm trying to achieve.

**date  Count of records**
02-10-2017  4
04-10-2017  3
03-10-2017  5

Here is the code which I tried to group by,

import java.text.SimpleDateFormat
val dateformat:SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
val df = sqlContext.read.csv("<path>")
val result = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")

But while executing code I am getting below exception.

   <console>:30: error: value toLong is not a member of org.apache.spark.sql.ColumnName
         val t = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")

Please help me to resolve the issue.

Upvotes: 0

Views: 326

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

you would need to change the date column, which seems to be in long, to date data type. This can be done by using from_unixtime built-in function. And then its just a groupBy and agg function calls and use count function.

import org.apache.spark.sql.functions._
def stringDate = udf((date: Long) => new java.text.SimpleDateFormat("dd-MM-yyyy").format(date))
df.withColumn("date", stringDate($"date"))
    .groupBy("date")
    .agg(count("Col1").as("Count of records"))
    .show(false) 

Above answer is using udf function which should be avoided as much as possible, since udf is a black box and requires serialization and deserialisation of columns.

Updated

Thanks to @philantrovert for his suggestion to divide by 1000

import org.apache.spark.sql.functions._
df.withColumn("date", from_unixtime($"date"/1000, "yyyy-MM-dd"))
    .groupBy("date")
    .agg(count("Col1").as("Count of records"))
    .show(false)

Both ways work.

Upvotes: 1

Related Questions