Reputation: 321
I have data ordered by timestamp, with the following structure:
+------------+--------+--------+----------+-------+
| timestamp | value | device | subgroup | group |
+------------+--------+--------+----------+-------+
| 1377986440 | 0 | 1 | 0 | 5 |
| 1377986440 | 2.25 | 1 | 0 | 5 |
| 1377986440 | 0 | 2 | 0 | 6 |
| 1377986440 | 0.135 | 0 | 0 | 6 |
| 1377986440 | 0.355 | 0 | 0 | 6 |
+------------+--------+--------+----------+-------+
I have already changed the timestamp from Long to TimestampType and records go from 1 September 2013 to 30 September 2013.
I need to compute the mean and standard deviation for value over the whole dataset into the following time windows: [00:00, 6:00), [06:00,12:00), [12:00, 18:00), [18:00,00:00) and for each group. For instance, the output should be like this:
+-------+--------------+------+-------+
| group | timeSlot | avg | std |
+-------+--------------+------+-------+
| 0 | 00:00 6:00 | 1.4 | 0.25 |
| 0 | 06:00 12:00 | 2.4 | 0.25 |
| ... | ... | .... | .... |
| 3 | 00:00 6:00 | 2.3 | 0.1 |
| 3 | 06:00 12:00 | 0.0 | 0.0 |
| ... | ... | ... | ... |
+-------+--------------+------+-------+
I have tried to use window how explained here so i converted my Unix timestamp in a TimestampType with format HH:mm:ss. Then my code is:
val data = df
.select("*")
.withColumn("timestamp", from_unixtime($"timestamp", "HH:mm:ss"))
val res = data.select("*")
.groupBy($"group", window($"timestamp", "6 hours", "6 hours"))
.agg(avg("value"), stddev("value"))
.orderBy("group")
However the first time slot doesn't start from 00:00 but from 02:00:00 also if I have not specified a starting point for the window. The output I get is:
+--------+------------------------------------------+---------------------+---------------------+
|group |window |avg(cons) |stddev_samp(cons) |
+--------+------------------------------------------+---------------------+---------------------+
|0 |[2018-05-18 02:00:00, 2018-05-18 08:00:00]|1.781448366186445E-4 |0.004346229072242386 |
|0 |[2018-05-18 14:00:00, 2018-05-18 20:00:00]|0.0045980360360061865|0.7113464184007158 |
|0 |[2018-05-18 20:00:00, 2018-05-19 02:00:00]|2.7686190915763437E-4|6.490469208721791E-4 |
|0 |[2018-05-17 20:00:00, 2018-05-18 02:00:00]|0.0016399597206953798|0.12325297254169619 |
|0 |[2018-05-18 08:00:00, 2018-05-18 14:00:00]|2.3354306613988956E-4|5.121337883543223E-4 |
|1 |[2018-05-18 20:00:00, 2018-05-19 02:00:00]|8.319111249637333E-4 |0.00163300686441327 |
|1 |[2018-05-18 14:00:00, 2018-05-18 20:00:00]|0.006463708881068344 |0.7907138759032012 |
|1 |[2018-05-18 02:00:00, 2018-05-18 08:00:00]|6.540241054052753E-4 |0.020490123866864617 |
How should I change my code? I have tried other solutions but none works
Upvotes: 4
Views: 3312
Reputation: 35219
You probably misconfigured timezone settings. Two hour shift suggests that you use GMT+2 or equivalent.
If you use Spark 2.3 (or later) you can just set timezone in your code (or configuration):
spark.conf.set("spark.sql.session.timeZone", "UTC")
Upvotes: 3