Reputation: 33
I am trying to count the number of trips (where a car is ON and moved a bit) of multiple cars over some fixed period. There is a signal, which when jumps from a value to another two possible values, it signifies that the car is ON, and when drops from those two possible values, signifies that the car is OFF, combined with the speed, this will tell me if it is a trip or not.
This is a gaps-and-islands problem, I need to identify the islands and do the rest. I was able to do that with one car. BUT when I include multiple cars, the query starts to behave weird, and give me different numbers of trips each time I add a car to it. I don't really know why is that happening, and the underlying cause of it.
Here is the query:
WITH cte_data AS
( SELECT car_id, time_stamp, CAST(json_extract_scalar(json_load,'$.Trip_siganl_value') AS INTEGER) AS current_value,
LAG(CAST(json_extract_scalar("json_load",'$.Trip_siganl_value') AS INTEGER)) OVER (PARTITION BY car_id ORDER BY time_stamp) as Previousvalue,
LEAD(CAST(json_extract_scalar("json_load",'$.Trip_siganl_value') AS INTEGER)) OVER (PARTITION BY car_id ORDER BY time_stamp) as Nextvalue,
CAST(json_extract_scalar("json_load",'$.speed') AS DOUBLE) AS speed
FROM data_base
WHERE "car_id" in ('car_id1', 'car_id2', 'car_id3')
ORDER BY time_stamp
),
cte_start AS
( SELECT car_id, "time_stamp" as starting_time_location, current_value ,
ROW_NUMBER() OVER (PARTITION BY car_id ORDER BY time_stamp) AS island_number
FROM cte_data
WHERE (Previousvalue < 5 OR Previousvalue IS NULL)
AND current_value in (5,6)
),
cte_end AS
( SELECT car_id, "time_stamp" as ending_time_location, current_value ,
ROW_NUMBER() OVER ( PARTITION BY car_id ORDER BY time_stamp) AS island_number
FROM cte_data
WHERE (Nextvalue < 5 OR Nextvalue IS NULL)
AND current_value in (5,6)
),
cte_final AS
(
SELECT cte_start.car_id AS car_id, cte_start.starting_time_location AS start_date_final, cte_end.ending_time_location as end_date_final,
(
SELECT COUNT(*)
FROM cte_data
WHERE cte_data.time_stamp >= cte_start.starting_time_location
AND cte_data.time_stamp <= cte_end.ending_time_location
) AS island_row_count,
(
SELECT SUM(speed)
FROM cte_data
WHERE cte_data.time_stamp >= cte_start.starting_time_location
AND cte_data.time_stamp <= cte_end.ending_time_location
) AS island_speed_sum
FROM cte_start
INNER JOIN cte_end
ON cte_start.island_number = cte_end.island_number
)
SELECT car_id, COUNT(*)
FROM cte_final
WHERE island_row_count > 1
AND island_speed_sum > 0
GROUP BY car_id
can someone tell me why it does work well with a single car, but fails with multiple?
I know, I can put the query inside a Python for-loop, and feed the list of cars individually to the sql query, but that seems like a dirty solution, and I want to understand why SQL is behaving this way with multiple cars.
I am using AWS Athena, So I guess it is Presto or Trino
Here is a dummy input sample data for two cars:
car_id time_stamp Trip_signal_value speed
----------------------------------------------
car_id1 1 3 0
car_id1 2 4 0
car_id1 3 5 0
car_id1 4 5 3
car_id1 5 5 5
car_id1 6 5 8
car_id1 7 6 10
car_id1 8 6 14
car_id1 9 5 10
car_id1 10 5 5
car_id1 11 6 3
car_id1 12 6 0
car_id1 13 3 0
car_id1 14 3 0
car_id1 15 3 0
car_id1 16 3 0
car_id1 17 0 0
car_id1 18 0 0
car_id1 19 0 0
car_id1 20 2 0
car_id1 21 4 0
car_id1 22 5 0
car_id1 23 5 3
car_id1 24 5 6
car_id1 25 5 5
car_id1 26 5 9
car_id1 27 5 10
car_id1 28 5 5
car_id1 29 5 4
car_id1 30 5 0
car_id1 31 2 0
car_id1 32 1 0
car_id1 33 0 0
car_id2 7 2 0
car_id2 8 2 0
car_id2 9 5 0
car_id2 10 5 2
car_id2 11 6 4
car_id2 12 6 9
car_id2 13 5 12
car_id2 14 5 13
car_id2 15 5 10
car_id2 16 5 7
car_id2 17 6 4
car_id2 18 5 0
car_id2 19 3 0
car_id2 20 2 0
car_id2 21 2 0
car_id2 22 2 0
Note that Trip_singal_value
should be either 5
or 6
so I know that the car is ON, any less values means the car is OFF, I guess that is reflected in the SQL query.
Here id what i want to typically get:
car_id Number_Of_trips
---------------------------
car_id1 2
car_id2 1
Thanks In advance
Upvotes: 0
Views: 117
Reputation: 33
Forgot to to put an answer earlier. Anyhow, the problem seemed to be the JOIN
that is mixing things up when used on multiple groups. The solution for me was a complete re-write of the SQL query without the use of JOIN
, LEAD
and LAG
, and replacing ROW_NUMBER
with DENSE_RANK
, here is the re-written query:
WITH cte_data as (
SELECT car_id
, time_stamp
, CASE
WHEN CAST(json_extract_scalar("json_load",'$.Trip_signal_value') AS INTEGER) in (5,6) THEN 'ON'
WHEN CAST(json_extract_scalar("json_load",'$.Trip_signal_value') AS INTEGER) NOT IN (5,6) THEN 'OFF'
WHEN CAST(json_extract_scalar("json_load",'$.Trip_signal_value') AS INTEGER) IS NULL THEN 'OFF'
END as car_status,
FROM data_base
ORDER BY time_stamp
),
ranking as (
SELECT
car_id
, time_stamp
, car_status
, speed
, DENSE_RANK() OVER (PARTITION BY car_id ORDER BY time_stamp)
- DENSE_RANK() OVER (PARTITION BY car_id, car_status ORDER BY time_stamp)
AS sequence_grouping
FROM cte_data
ORDER BY time_stamp
),
cte_final as (
SELECT
car_id
, sequence_grouping
, MIN(time_stamp) as start_time
, MAX(time_stamp) as end_time
, MAX(time_stamp) - MIN(time_stamp) as duration
, SUM(speed) as Mileage
FROM ranking
WHERE car_status = 'ON'
GROUP BY car_id
, sequence_grouping
)
SELECT "car_id", COUNT(*) as Number_Of_Trips
FROM cte_final
WHERE Mileage > 0.0
AND duration > 1
GROUP BY "car_id
Performance was also improved by avoiding the use of JOIN
, as execution-time was way less (if I remember well, it was four times less).
Here is where I got the idea of the second query from.
Upvotes: 1