user7006069
user7006069

Reputation: 87

aggregated at week start date(Monday) for the complete week

Aggregated at week start date(Monday) for the complete week.

window function, we cannot add start day as monday for a week aggregation data in spark. or any work around to it.

df = spark.createDataFrame([
  ("001", "event1", 10, "2016-05-01 10:50:51"),
  ("002", "event2", 100, "2016-05-02 10:50:53"),
  ("001", "event3", 20, "2016-05-03 10:50:55"),
  ("010", "event3", 20, "2016-05-05 10:50:55"),
  ("001", "event1", 15, "2016-05-01 10:51:50"),
  ("003", "event1", 13, "2016-05-10 10:55:30"),
  ("001", "event2", 12, "2016-05-11 10:57:00"),
  ("001", "event3", 11, "2016-05-21 11:00:01"),
  ("002", "event2", 100, "2016-05-22 10:50:53"),
  ("001", "event3", 20, "2016-05-28 10:50:55"),
  ("001", "event1", 15, "2016-05-30 10:51:50"),
  ("003", "event1", 13, "2016-06-10 10:55:30"),
  ("001", "event2", 12, "2016-06-12 10:57:00"),
  ("001", "event3", 11, "2016-06-14 11:00:01")]).toDF("KEY", "Event_Type", "metric", "Time")

df2 = df.groupBy(window("Time", "7 day")).agg(sum("KEY").alias('aggregate_sum')).select("window.start", "window.end", "aggregate_sum").orderBy("window")

expected output should be aggregated data starting from Monday for a week. However spark itself starts the week aggregation for 7 days from any day.

Upvotes: 2

Views: 827

Answers (1)

RickyG
RickyG

Reputation: 138

Windows default to start at 1970-01-01 which is a Thursday. You can use

window("Time", "7 day", startTime="4 days")

to shift that to Mondays.

Upvotes: 4

Related Questions