rp372
rp372

Reputation: 345

Spark SQL window function/lag() giving unexpected resutls

EDIT: After some more troubleshooting, I found the following - I added the lag(event_time) to the query to see what dates the query was collecting and am getting equally strange results:

SELECT device_id,
       event_time,
       unix_time,
       event_id,
       lag(event_time) OVER (PARTITION BY device_id ORDER BY unix_time,event_id) AS lag_time
FROM ios_d_events;

For some device_ids this returns as expected, however some device_ids return the following:

              device_id               |     event_time      |    unix_time    |               event_id               |          lag_time
--------------------------------------+---------------------+-----------------+--------------------------------------+----------------------------
 111111111111111111111111111111111111 | 2015-03-01 10:41:47 | 1425206507.0000 | 4C48BE67-31EB-4432-96EF-0F30B4191340 |
 111111111111111111111111111111111111 | 2015-03-01 10:41:48 | 1425206508.0000 | A4AE33A2-6CDC-480C-A8D7-C4024810D236 | 2015-03-01 10:41:47
 111111111111111111111111111111111111 | 2015-03-01 10:41:51 | 1425206511.0000 | 997614AE-CE7E-45F6-BCE3-93E70CD46609 | 2015-03-01 10:41:48
 111111111111111111111111111111111111 | 2015-03-01 10:41:53 | 1425206513.0000 | 202DA38C-1823-4100-85AB-3DE139FB3CE3 | 2015-03-01 10:41:51
 111111111111111111111111111111111111 | 2015-03-01 10:42:11 | 1425206531.0000 | 8DFA7938-2123-4978-A89D-6C92404B504D | 2015-03-01 10:41:53
 111111111111111111111111111111111111 | 2015-03-01 10:42:12 | 1425206532.0000 | 463E9833-9526-4E76-A907-4651C13991A0 | 2015-03-01 10:42:11
 111111111111111111111111111111111111 | 2015-03-01 10:42:14 | 1425206534.0000 | 8204696E-3DAA-423E-9031-EC80BFA1157E | 2015-03-01 10:42:12
 111111111111111111111111111111111111 | 2015-03-01 10:42:20 | 1425206540.0000 | 10DB02E2-2D4F-4611-98D7-966074FBD398 | 2015-03-01 10:42:14
 111111111111111111111111111111111111 | 2015-03-01 10:42:20 | 1425206540.0000 | 50535667-A02D-47F7-8320-86EAC4638964 | 2015-03-01 10:42:20
 111111111111111111111111111111111111 | 2015-03-01 10:42:27 | 1425206547.0000 | 6C8BC79D-CB3E-4FE2-8EDD-B8421EA237FF | 2015-03-01 10:42:20
 111111111111111111111111111111111111 | 2015-03-01 10:42:27 | 1425206547.0000 | A732E59D-2EEE-44AE-BB5B-D016E6E6C597 | 2015-03-01 10:42:27
 111111111111111111111111111111111111 | 2015-03-01 10:42:27 | 1425206547.0000 | ABBE4184-65C6-41C1-AC0B-74828DDD8DE8 | 2015-03-01 10:42:27
 111111111111111111111111111111111111 | 2015-03-01 10:42:40 | 1425206560.0000 | 03D0B5FF-9E4D-4F14-8169-7D1C93C617B3 | 2015-03-01 10:42:27
 111111111111111111111111111111111111 | 2015-03-01 10:42:40 | 1425206560.0000 | 5C1DFE08-8081-4C84-9D8F-E29EBEB1C9D5 | 2015-03-01 10:42:40
 111111111111111111111111111111111111 | 2015-03-01 10:42:40 | 1425206560.0000 | 82C50F8E-9790-4C27-8979-9484954934B4 | 2015-03-01 10:42:40
 111111111111111111111111111111111111 | 2015-03-01 10:42:42 | 1425206562.0000 | 2D30722E-FB37-4563-B2CD-FA545D95AAB4 | 2015-03-01 10:42:40
 111111111111111111111111111111111111 | 2015-03-01 10:42:49 | 1425206569.0000 | 7613F856-763E-4792-904E-8D89E2502710 | 2015-03-01 10:42:42
 111111111111111111111111111111111111 | 2015-03-01 10:43:01 | 1425206581.0000 | DB39294A-E133-4A05-B367-210944965FA3 | 2015-03-01 10:42:49
 111111111111111111111111111111111111 | 2015-03-01 10:43:02 | 1425206582.0000 | 61B8AE48-D5C8-4809-9C4F-56EEA45E3626 | 2015-03-01 10:43:01
 111111111111111111111111111111111111 | 2015-03-01 10:43:02 | 1425206582.0000 | 82870AB0-08F1-403F-B805-836CC1454D1A | 2015-03-01 10:43:02
 111111111111111111111111111111111111 | 2015-03-01 10:43:04 | 1425206584.0000 | BA24E540-0F5B-4BC4-B59B-29D729FA5C35 | 2015-03-01 10:43:02
 999999999999999999999999999999999999 | 2015-05-13 16:40:19 | 1431535219.0000 | 25E20777-508D-4194-9324-BE8A44CE7B59 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:40:19 | 1431535219.0000 | 72DCEE64-CB3A-43CD-A949-FBE35C0A8873 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:40:21 | 1431535221.0000 | A0062926-C8A0-47ED-A7F0-65716893DDC0 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:40:25 | 1431535225.0000 | 0BCEABD0-DCEB-431C-A6FE-6E1243FC9890 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:40:34 | 1431535234.0000 | 8F72E8D1-A167-460B-9034-C4D1423CAF15 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:41:19 | 1431535279.0000 | 44C0214C-E13B-4CBB-92BB-4809EAD5DF15 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:41:20 | 1431535280.0000 | C8CBB801-E8F3-47FD-B9C9-50BD0DDFA356 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:41:22 | 1431535282.0000 | C3173090-8BB8-407F-BFF5-07B58BCFCA7B | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:41:48 | 1431535308.0000 | 91290A91-2394-4C5F-8687-6A74593F9FFB | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:41:48 | 1431535308.0000 | EF138C8C-93BC-4C29-8438-EB0214B3CFC4 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:41:52 | 1431535312.0000 | 76AA328F-D46D-4235-B780-2B05FCE02855 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:41:53 | 1431535313.0000 | 563E04CE-7DDB-4543-8F1D-C81625E839F8 | 1970-01-03 09:15:58.430244

It Looks like this happens when the first two events of a device_id have the same timestamp. Hopefully that gives someone a clue?


OP

I am trying to use the lag() function in Spark SQL to determine the length of time between two subsequent events in a table. The important columns are device_id, a text column, unix_time, a numeric timestamp, and event_id, which is unique for each row.

The query I am running:

SELECT device_id,
       unix_time,
       event_id,
       unix_time - lag(unix_time)
          OVER
        (PARTITION BY device_id ORDER BY unix_time,event_id)
         AS seconds_since_last_event
FROM ios_d_events;

In Postgres this gives the expected result - however when run in Pyspark, any time there are two events with the same timestamp, seconds_since_last_event is calculated as a large number, ie -1435151676846888 or -1431583545415023, or 25534 - I can't figure out where these numbers are coming from.

I tried adding an if() statement to the query, as

if(unix_time = lag(unix_time)
     OVER (PARTITION BY device_id ORDER BY unix_time,event_id),
   0,
   unix_time - lag(unix_time)
     OVER (PARTITION BY device_id ORDER BY unix_time,event_id))
 AS seconds_since_last_event

But I am getting the same results. Any ideas what might be causing this?

Upvotes: 0

Views: 1548

Answers (1)

rp372
rp372

Reputation: 345

It turns out this is a bug in Spark 1.5.0 - The issue does not present itself in 1.5.1.

Upvotes: 1

Related Questions