user7183867
user7183867

Reputation:

Calculate the sum on the 24 hours time frame in spark dataframe

I want to calculate the sum on date and date+1(24 hours) by filtering the rows based on hours.

1, 2018-05-01 02:12:00,1
1, 2018-05-01 03:16:10,2
1, 2018-05-01 09:12:00,4
1, 2018-05-01 14:18:00,3
1, 2018-05-01 18:32:00,1
1, 2018-05-01 20:12:00,1
1, 2018-05-02 01:22:00,1
1, 2018-05-02 02:12:00,1
1, 2018-05-02 08:30:00,1
1, 2018-05-02 10:12:00,1
1, 2018-05-02 11:32:00,1
1, 2018-05-02 18:12:00,1
1, 2018-05-03 03:12:00,1
1, 2018-05-03 08:22:00,1

Here, example I have filtered the rows from 9AM to 9AM(next date) Output

1, 2018-05-01,12
1, 2018-05-02,5

Upvotes: 0

Views: 976

Answers (3)

jxc
jxc

Reputation: 13998

Just shift the time of your timestamp column by 9 hours and then groupby the date of the adjusted column:

from pyspark.sql.functions import expr, sum as fsum 

df
# DataFrame[id: int, dtime: timestamp, cnt: int]

df.groupby("id", expr("date(dtime - interval 9 hours) as ddate")) \
  .agg(fsum("cnt").alias("cnt")) \
  .show()
+---+----------+---+
| id|     ddate|cnt|
+---+----------+---+
|  1|2018-05-01| 12|
|  1|2018-05-02|  5|
|  1|2018-04-30|  3|
+---+----------+---+

Upvotes: 0

notNull
notNull

Reputation: 31520

Use date_format(), date_add(),to_date() and then groupBy,aggregate spark built in functions.

Example:

Spark-Scala:

df.show()
//+---+-------------------+---+
//| id|               date|cnt|
//+---+-------------------+---+
//|  1|2018-05-01 02:12:00|  1|
//|  1|2018-05-01 03:16:10|  2|
//|  1|2018-05-01 09:12:00|  4|
//|  1|2018-05-01 14:18:00|  3|
//|  1|2018-05-01 18:32:00|  1|
//|  1|2018-05-01 20:12:00|  1|
//|  1|2018-05-02 01:22:00|  1|
//|  1|2018-05-02 02:12:00|  1|
//|  1|2018-05-02 08:30:00|  1|
//|  1|2018-05-02 10:12:00|  1|
//|  1|2018-05-02 11:32:00|  1|
//|  1|2018-05-02 18:12:00|  1|
//|  1|2018-05-03 03:12:00|  1|
//|  1|2018-05-03 08:22:00|  1|
//+---+-------------------+---+

df.withColumn("hour",when(date_format(col("date"),"HH").cast("int") >= 9,to_date(col("date"))).otherwise(date_add(to_date(col("date")),-1))).
groupBy("id","hour").
agg(sum("cnt").cast("int").alias("sum")).
show()
//+---+----------+---+
//| id|      hour|sum|
//+---+----------+---+
//|  1|2018-05-01| 12|
//|  1|2018-05-02|  5|
//|  1|2018-04-30|  3|
//+---+----------+---+

Pyspark:

from pyspark.sql.functions import *
from pyspark.sql.types import *
df.withColumn("hour",when(date_format(col("date"),"HH").cast("int") >= 9,to_date(col("date"))).otherwise(date_add(to_date(col("date")),-1))).\
groupBy("id","hour").\
agg(sum("cnt").cast("int").alias("sum")).\
show()
#+---+----------+---+
#| id|      hour|sum|
#+---+----------+---+
#|  1|2018-05-01| 12|
#|  1|2018-05-02|  5|
#|  1|2018-04-30|  3|
#+---+----------+---+

Upvotes: 0

ipj
ipj

Reputation: 3598

First define df for reproducibility:

import pandas as pd
import io
data=\
"""
1, 2018-05-01 02:12:00,1
1, 2018-05-01 03:16:10,2
1, 2018-05-01 09:12:00,4
1, 2018-05-01 14:18:00,3
1, 2018-05-01 18:32:00,1
1, 2018-05-01 20:12:00,1
1, 2018-05-02 01:22:00,1
1, 2018-05-02 02:12:00,1
1, 2018-05-02 08:30:00,1
1, 2018-05-02 10:12:00,1
1, 2018-05-02 11:32:00,1
1, 2018-05-02 18:12:00,1
1, 2018-05-03 03:12:00,1
1, 2018-05-03 08:22:00,1
"""

df = pd.read_csv(io.StringIO(data), sep = ',', names = ['id','t', 'n'], parse_dates =['t']) 

Then use pd.Grouper with frequency set to 24h and base parameter set to 9, which indicates period is beggining at 9 a.m.:

df.groupby(pd.Grouper(key='t', freq='24h', base=9)).n.sum()

result:

t
2018-04-30 09:00:00     3
2018-05-01 09:00:00    12
2018-05-02 09:00:00     5
Freq: 24H, Name: n, dtype: int64

Upvotes: 1

Related Questions