SomeOne
SomeOne

Reputation: 33

Counting islands over multiple partitions

I am trying to count the number of trips (where a car is ON and moved a bit) of multiple cars over some fixed period. There is a signal, which when jumps from a value to another two possible values, it signifies that the car is ON, and when drops from those two possible values, signifies that the car is OFF, combined with the speed, this will tell me if it is a trip or not.

This is a gaps-and-islands problem, I need to identify the islands and do the rest. I was able to do that with one car. BUT when I include multiple cars, the query starts to behave weird, and give me different numbers of trips each time I add a car to it. I don't really know why is that happening, and the underlying cause of it.

Here is the query:

WITH cte_data AS
(   SELECT car_id, time_stamp, CAST(json_extract_scalar(json_load,'$.Trip_siganl_value') AS INTEGER) AS current_value,
         LAG(CAST(json_extract_scalar("json_load",'$.Trip_siganl_value') AS INTEGER)) OVER (PARTITION BY car_id ORDER BY time_stamp) as Previousvalue,
         LEAD(CAST(json_extract_scalar("json_load",'$.Trip_siganl_value') AS INTEGER)) OVER (PARTITION BY car_id ORDER BY time_stamp) as Nextvalue, 
         CAST(json_extract_scalar("json_load",'$.speed') AS DOUBLE)  AS speed
    FROM data_base
    WHERE "car_id"  in ('car_id1', 'car_id2', 'car_id3')
    ORDER BY time_stamp 
), 

cte_start AS
(   SELECT car_id, "time_stamp" as starting_time_location, current_value ,
            ROW_NUMBER() OVER (PARTITION BY car_id ORDER BY time_stamp) AS island_number
    FROM cte_data
    WHERE (Previousvalue < 5 OR Previousvalue IS NULL)
    AND current_value in (5,6)

),

cte_end AS
(   SELECT car_id, "time_stamp" as ending_time_location, current_value ,
            ROW_NUMBER() OVER ( PARTITION BY car_id ORDER BY time_stamp) AS island_number
    FROM cte_data
    WHERE (Nextvalue < 5 OR Nextvalue IS NULL)
    AND current_value  in (5,6)


), 
cte_final AS 
( 
SELECT cte_start.car_id AS car_id, cte_start.starting_time_location AS start_date_final, cte_end.ending_time_location as end_date_final, 
( 
    SELECT COUNT(*)
    FROM cte_data 
    WHERE cte_data.time_stamp >= cte_start.starting_time_location 
    AND cte_data.time_stamp <= cte_end.ending_time_location


) AS island_row_count, 
( 
    SELECT SUM(speed)
    FROM cte_data 
    WHERE cte_data.time_stamp >= cte_start.starting_time_location 
    AND cte_data.time_stamp <= cte_end.ending_time_location


) AS island_speed_sum 
FROM cte_start 
INNER JOIN cte_end 
ON cte_start.island_number = cte_end.island_number

)

SELECT car_id, COUNT(*)
FROM cte_final 
WHERE island_row_count > 1
AND island_speed_sum > 0
GROUP BY car_id

can someone tell me why it does work well with a single car, but fails with multiple?

I know, I can put the query inside a Python for-loop, and feed the list of cars individually to the sql query, but that seems like a dirty solution, and I want to understand why SQL is behaving this way with multiple cars.

I am using AWS Athena, So I guess it is Presto or Trino

Here is a dummy input sample data for two cars:

car_id  time_stamp  Trip_signal_value   speed
----------------------------------------------
car_id1     1           3                   0
car_id1     2           4                   0
car_id1     3           5                   0
car_id1     4           5                   3
car_id1     5           5                   5
car_id1     6           5                   8
car_id1     7           6                   10
car_id1     8           6                   14
car_id1     9           5                   10
car_id1     10          5                   5
car_id1     11          6                   3
car_id1     12          6                   0
car_id1     13          3                   0
car_id1     14          3                   0
car_id1     15          3                   0
car_id1     16          3                   0
car_id1     17          0                   0
car_id1     18          0                   0
car_id1     19          0                   0
car_id1     20          2                   0
car_id1     21          4                   0
car_id1     22          5                   0
car_id1     23          5                   3
car_id1     24          5                   6
car_id1     25          5                   5
car_id1     26          5                   9
car_id1     27          5                   10
car_id1     28          5                   5
car_id1     29          5                   4
car_id1     30          5                   0
car_id1     31          2                   0
car_id1     32          1                   0
car_id1     33          0                   0
car_id2     7           2                   0
car_id2     8           2                   0
car_id2     9           5                   0
car_id2     10          5                   2
car_id2     11          6                   4
car_id2     12          6                   9
car_id2     13          5                   12
car_id2     14          5                   13
car_id2     15          5                   10
car_id2     16          5                   7
car_id2     17          6                   4
car_id2     18          5                   0
car_id2     19          3                   0
car_id2     20          2                   0
car_id2     21          2                   0
car_id2     22          2                   0

Note that Trip_singal_value should be either 5 or 6 so I know that the car is ON, any less values means the car is OFF, I guess that is reflected in the SQL query.

Here id what i want to typically get:

car_id      Number_Of_trips 
---------------------------
car_id1             2
car_id2             1

Thanks In advance

Upvotes: 0

Views: 117

Answers (1)

SomeOne
SomeOne

Reputation: 33

Forgot to to put an answer earlier. Anyhow, the problem seemed to be the JOIN that is mixing things up when used on multiple groups. The solution for me was a complete re-write of the SQL query without the use of JOIN, LEAD and LAG, and replacing ROW_NUMBER with DENSE_RANK , here is the re-written query:

WITH cte_data as ( 
    SELECT car_id
        , time_stamp
        , CASE 
            WHEN  CAST(json_extract_scalar("json_load",'$.Trip_signal_value') AS INTEGER) in (5,6) THEN 'ON'
            WHEN  CAST(json_extract_scalar("json_load",'$.Trip_signal_value') AS INTEGER) NOT IN (5,6)  THEN 'OFF'
            WHEN  CAST(json_extract_scalar("json_load",'$.Trip_signal_value') AS INTEGER) IS NULL THEN 'OFF'
            END as car_status, 
    FROM data_base
    ORDER BY time_stamp
), 
ranking as (
    SELECT 
        car_id
        , time_stamp
        , car_status
        , speed
        , DENSE_RANK() OVER (PARTITION BY car_id ORDER BY time_stamp) 
        - DENSE_RANK() OVER (PARTITION BY car_id, car_status ORDER BY time_stamp)
            AS sequence_grouping 
    FROM cte_data 
    ORDER BY time_stamp
),
cte_final as (
    SELECT 
        car_id
        , sequence_grouping
        , MIN(time_stamp) as start_time
        , MAX(time_stamp) as end_time 
        , MAX(time_stamp) - MIN(time_stamp) as duration
        , SUM(speed) as Mileage
    FROM ranking 
    WHERE car_status = 'ON'
    GROUP BY car_id
            , sequence_grouping 
)
SELECT "car_id", COUNT(*) as Number_Of_Trips
FROM cte_final
WHERE Mileage > 0.0
AND duration > 1
GROUP BY "car_id

Performance was also improved by avoiding the use of JOIN, as execution-time was way less (if I remember well, it was four times less).

Here is where I got the idea of the second query from.

Upvotes: 1

Related Questions