Aggregation with arbitrary granularity in Spark

Question

I'm trying to make some aggregation in a DataFrame with this simple structure:

+--------------------+---------+---------+
|                date|    data1|    data2|
+--------------------+---------+---------+
|2016-01-30 00:10:...|   0.0596|   0.5385|
|2016-01-30 00:20:...|  0.05578|   0.5385|
|2016-01-30 00:30:...|  0.03375|   0.5055|
|2016-01-30 00:40:...|   0.0379|  0.51266|
|2016-01-30 00:50:...|  0.05471|   0.5345|
|2016-01-30 01:00:...|   0.0926|  0.54833|
......

I need to aggregate (SUM) data1 and data2, grouped by date, with an arbitrary granularity (say, daily, hourly, weekly, 12 hours...).

I've just managed to do that daily:

SELECT TO_DATE(date) as date, SUM(data1) as data1, SUM(data2) as data2
FROM my_table
GROUP BY TO_DATE(date)

This works fine, but I have no idea how can I achieve the other aggregations.

David Griffin · Accepted Answer

You can try something like:

SELECT concat(to_date(date), ':', hour(date)) as date_hour
  ...
GROUP BY date_hour

Or even

SELECT to_date(date), hour(date)
  ...
GROUP BY to_date(date), hour(date)

Aggregation with arbitrary granularity in Spark

Answers (1)

Related Questions