Reputation:
I want to calculate the sum on date and date+1(24 hours) by filtering the rows based on hours.
1, 2018-05-01 02:12:00,1
1, 2018-05-01 03:16:10,2
1, 2018-05-01 09:12:00,4
1, 2018-05-01 14:18:00,3
1, 2018-05-01 18:32:00,1
1, 2018-05-01 20:12:00,1
1, 2018-05-02 01:22:00,1
1, 2018-05-02 02:12:00,1
1, 2018-05-02 08:30:00,1
1, 2018-05-02 10:12:00,1
1, 2018-05-02 11:32:00,1
1, 2018-05-02 18:12:00,1
1, 2018-05-03 03:12:00,1
1, 2018-05-03 08:22:00,1
Here, example I have filtered the rows from 9AM to 9AM(next date) Output
1, 2018-05-01,12
1, 2018-05-02,5
Upvotes: 0
Views: 976
Reputation: 13998
Just shift the time of your timestamp column by 9 hours and then groupby the date of the adjusted column:
from pyspark.sql.functions import expr, sum as fsum
df
# DataFrame[id: int, dtime: timestamp, cnt: int]
df.groupby("id", expr("date(dtime - interval 9 hours) as ddate")) \
.agg(fsum("cnt").alias("cnt")) \
.show()
+---+----------+---+
| id| ddate|cnt|
+---+----------+---+
| 1|2018-05-01| 12|
| 1|2018-05-02| 5|
| 1|2018-04-30| 3|
+---+----------+---+
Upvotes: 0
Reputation: 31520
Use date_format(), date_add(),to_date()
and then groupBy,aggregate
spark built in functions.
Example:
Spark-Scala:
df.show()
//+---+-------------------+---+
//| id| date|cnt|
//+---+-------------------+---+
//| 1|2018-05-01 02:12:00| 1|
//| 1|2018-05-01 03:16:10| 2|
//| 1|2018-05-01 09:12:00| 4|
//| 1|2018-05-01 14:18:00| 3|
//| 1|2018-05-01 18:32:00| 1|
//| 1|2018-05-01 20:12:00| 1|
//| 1|2018-05-02 01:22:00| 1|
//| 1|2018-05-02 02:12:00| 1|
//| 1|2018-05-02 08:30:00| 1|
//| 1|2018-05-02 10:12:00| 1|
//| 1|2018-05-02 11:32:00| 1|
//| 1|2018-05-02 18:12:00| 1|
//| 1|2018-05-03 03:12:00| 1|
//| 1|2018-05-03 08:22:00| 1|
//+---+-------------------+---+
df.withColumn("hour",when(date_format(col("date"),"HH").cast("int") >= 9,to_date(col("date"))).otherwise(date_add(to_date(col("date")),-1))).
groupBy("id","hour").
agg(sum("cnt").cast("int").alias("sum")).
show()
//+---+----------+---+
//| id| hour|sum|
//+---+----------+---+
//| 1|2018-05-01| 12|
//| 1|2018-05-02| 5|
//| 1|2018-04-30| 3|
//+---+----------+---+
Pyspark:
from pyspark.sql.functions import *
from pyspark.sql.types import *
df.withColumn("hour",when(date_format(col("date"),"HH").cast("int") >= 9,to_date(col("date"))).otherwise(date_add(to_date(col("date")),-1))).\
groupBy("id","hour").\
agg(sum("cnt").cast("int").alias("sum")).\
show()
#+---+----------+---+
#| id| hour|sum|
#+---+----------+---+
#| 1|2018-05-01| 12|
#| 1|2018-05-02| 5|
#| 1|2018-04-30| 3|
#+---+----------+---+
Upvotes: 0
Reputation: 3598
First define df
for reproducibility:
import pandas as pd
import io
data=\
"""
1, 2018-05-01 02:12:00,1
1, 2018-05-01 03:16:10,2
1, 2018-05-01 09:12:00,4
1, 2018-05-01 14:18:00,3
1, 2018-05-01 18:32:00,1
1, 2018-05-01 20:12:00,1
1, 2018-05-02 01:22:00,1
1, 2018-05-02 02:12:00,1
1, 2018-05-02 08:30:00,1
1, 2018-05-02 10:12:00,1
1, 2018-05-02 11:32:00,1
1, 2018-05-02 18:12:00,1
1, 2018-05-03 03:12:00,1
1, 2018-05-03 08:22:00,1
"""
df = pd.read_csv(io.StringIO(data), sep = ',', names = ['id','t', 'n'], parse_dates =['t'])
Then use pd.Grouper
with frequency set to 24h and base
parameter set to 9, which indicates period is beggining at 9 a.m.:
df.groupby(pd.Grouper(key='t', freq='24h', base=9)).n.sum()
result:
t
2018-04-30 09:00:00 3
2018-05-01 09:00:00 12
2018-05-02 09:00:00 5
Freq: 24H, Name: n, dtype: int64
Upvotes: 1