Reputation: 302
I have the following table:
id | number_of _trip | start_date | end_date | seconds
1 637hui 2022-03-10 01:20:00 2022-03-10 01:32:00 720
2 384nfj 2022-03-10 02:18:00 2022-03-10 02:42:00 1440
3 102fiu 2022-03-10 02:10:00 2022-03-10 02:23:00 780
4 948pvc 2022-03-10 02:40:00 2022-03-10 03:20:00 2400
5 473mds 2022-03-10 02:45:00 2022-03-10 02:58:00 780
6 103fkd 2022-03-10 03:05:00 2022-03-10 03:28:00 1380
7 905783 2022-03-10 03:12:00 null 0
8 498wsq 2022-03-10 05:30:00 2022-03-10 05:48:00 1080
I want to get the time that is driven for each hour, but if a trip takes the space of two hours, the time must be taken for each hour.
If the end of the trip has not yet finished, the end_date
field is null, but it must count the time it is taking in the respective hours from start_date
.
I have the following query:
SELECT time_bucket(bucket_width := INTERVAL '1 hour',ts := start_date, "offset" := '0 minutes') AS init_date,
sum(seconds) as seconds
FROM trips
WHERE start_date >= '2022-03-10 01:00:00' AND start_date <= '2022-03-10 06:00:00'
GROUP BY init_date
ORDER BY init_date;
The result is:
| init_date | seconds
2022-03-10 01:00:00 720
2022-03-10 02:00:00 5400
2022-03-10 03:00:00 1380
2022-03-10 05:00:00 1080
However I expect to receive a result like this:
| init_date | seconds solo como una ayuda visual
2022-03-10 01:00:00 720 id(1:720)
2022-03-10 02:00:00 4200 id(2: 1440 3: 780 4: 1200 5: 780)
2022-03-10 03:00:00 5460 id(4:1200 6:1380 7:2880)
2022-03-10 05:00:00 1080 id(8:1080)
EDIT
If I replace the null the result is still unwanted:
| init_date | seconds
2022-03-10 01:00:00 720
2022-03-10 02:00:00 5400
2022-03-10 03:00:00 1380
2022-03-10 05:00:00 1080
I have been thinking about getting all the data and solving the problem with pandas. I'll try and post if I get the answer. EDIT
My previous result was not entirely correct, since there were hours left of a trip that has not yet finished, the correct result should be:
start_date seconds
0 2022-03-10 01:00:00 720
1 2022-03-10 02:00:00 4200
2 2022-03-10 03:00:00 5460
3 2022-03-10 04:00:00 3600
4 2022-03-10 05:00:00 4680
NEW CODE
def bucket_count(bucket, data):
result = pd.DataFrame()
list_r = []
for row_bucket in bucket.to_dict('records'):
inicio = row_bucket['start_date']
fin = row_bucket['end_date']
df = data[
(inicio <= data['end_date']) & (inicio <= fin) & (data['start_date'] <= fin) & (data['start_date'] <= data['end_date'])
]
df_dict = df.to_dict('records')
for row in df_dict:
seconds = 0
if row['start_date'] >= inicio and fin >= row['end_date']:
seconds = (row['end_date'] - row['start_date']).total_seconds()
elif row['start_date'] <= inicio <= row['end_date'] <= fin:
seconds = (row['end_date'] - inicio).total_seconds()
elif inicio <= row['start_date'] <= fin <= row['end_date']:
seconds = (fin - row['start_date']).total_seconds()
elif row['start_date'] < inicio and fin < row['end_date']:
seconds = (fin - inicio).total_seconds()
row['start_date'] = inicio
row['end_date'] = fin
row['seconds'] = seconds
list_r.append(row)
result = pd.DataFrame(list_r)
return result.groupby(['start_date'])["seconds"].apply(lambda x: x.astype(int).sum()).reset_index()
Upvotes: 6
Views: 1000
Reputation: 2271
This answer will use staircase, which is built upon pandas and numpy, and operates as part of the pandas ecosystem.
Your data describes intervals, which can be thought of as step functions which have a value of 1 during the interval and 0 otherwise. Using staircase
we will add the step functions for each trip together, slice the step function into hour buckets, and then integrate to get the total time for each bucket.
setup
Dataframe with pandas.Timestamp
. The trip number not relevant in this solution.
df = pd.DataFrame({
"start_date": [
pd.Timestamp("2022-03-10 1:20"),
pd.Timestamp("2022-03-10 2:18"),
pd.Timestamp("2022-03-10 2:10"),
pd.Timestamp("2022-03-10 2:40"),
pd.Timestamp("2022-03-10 2:45"),
pd.Timestamp("2022-03-10 3:05"),
pd.Timestamp("2022-03-10 3:12"),
pd.Timestamp("2022-03-10 5:30"),
],
"end_date": [
pd.Timestamp("2022-03-10 1:32"),
pd.Timestamp("2022-03-10 2:42"),
pd.Timestamp("2022-03-10 2:23"),
pd.Timestamp("2022-03-10 3:20"),
pd.Timestamp("2022-03-10 2:58"),
pd.Timestamp("2022-03-10 3:28"),
pd.NaT,
pd.Timestamp("2022-03-10 5:48"),
],
})
solution
import staircase as sc
# create step function
# the Stairs class represents a step function. It is to staircase as DataFrame is to pandas.
sf = sc.Stairs(df, start="start_date", end="end_date")
# you could visually inspect it if you want
sf.plot(style="hlines")
From inspection you can see the maximum concurrent trips is 3. Also note the step function continues on to infinity with a value of 1 - this is because we do not know the end date for one of the records.
# define hourly buckets as pandas PeriodIndex
hour_buckets = pd.period_range("2022-03-10 1:00", "2022-03-10 5:00", freq="H")
# integrate the step function over the hourly buckets
total_per_hour = sf.slice(hour_buckets).integral()
total_per_hour
is a pandas.Series
of pandas.Timedelta
values and indexed by a pandas.IntervalIndex
. It looks like this
[2022-03-10 01:00:00, 2022-03-10 02:00:00) 0 days 00:12:00
[2022-03-10 02:00:00, 2022-03-10 03:00:00) 0 days 01:10:00
[2022-03-10 03:00:00, 2022-03-10 04:00:00) 0 days 01:31:00
[2022-03-10 04:00:00, 2022-03-10 05:00:00) 0 days 01:00:00
[2022-03-10 05:00:00, 2022-03-10 06:00:00) 0 days 01:18:00
dtype: timedelta64[ns]
If you want a dataframe format where only the left side of the interval is referenced, and time is given as seconds, then use the following
pd.DataFrame({
"init_date":total_per_hour.index.left,
"seconds":total_per_hour.dt.total_seconds().values,
})
to summarise
The solution is
import staircase as sc
hour_buckets = pd.period_range("2022-03-10 1:00", "2022-03-10 5:00", freq="H")
total_per_hour = sc.Stairs(df, start="start_date", end="end_date").slice(hour_buckets).integral()
# optional
total_per_hour = pd.DataFrame({
"init_date":total_per_hour.index.left,
"seconds":total_per_hour.dt.total_seconds().values,
})
note 1
In your expected answer you do not have values for 2022-03-10 04:00:00
.
This seems inconsistent with the fact that time for trip 905783
(with no end date) is being included for 2022-03-10 03:00:00
but not subsequent hours.
The solution proposed here includes 3600s for 2022-03-10 04:00:00
and 2022-03-10 05:00:00
which is why it differs from the expected solution in the original question.
note 2
If your dataframe has a "driver" column and you want to tally time per driver then the following will work
def make_total_by_hour(df_):
return sc.Stairs(df_, "start_date", "end_date").slice(hour_buckets).integral()
total_per_hour = (
df.groupby("driver")
.apply(make_total_by_hour)
.melt(ignore_index=False)
.reset_index()
)
note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
Upvotes: 0
Reputation: 538
This can be done in plain sql (apart from time_bucket
function), in a nested sql query:
select
interval_start,
sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
select
interval_start,
greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour') as interval_start) i
join trips t
on t.start_date <= i.interval_start + interval '1 hour'
and coalesce(t.end_date, '2022-03-10 06:00:00') >= interval_start
) subq
group by interval_start
order by interval_start;
This gives me the following result:
interval_start | seconds
---------------------+---------
2022-03-10 01:00:00 | 720
2022-03-10 02:00:00 | 4200
2022-03-10 03:00:00 | 5460
2022-03-10 04:00:00 | 3600
2022-03-10 05:00:00 | 4680
2022-03-10 06:00:00 | 0
(6 rows)
Let's break the query down.
In the innermost query:
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour'
) as interval_start
we generate a series of time interval starts - from minimal start_date
value up to the maximal end_time
value, truncated to full hours, with 1-hour step. Each boundary can obviously be replaced with an arbitrary datetime. Direct result of this query is the following:
interval_start
---------------------
2022-03-10 01:00:00
2022-03-10 02:00:00
2022-03-10 03:00:00
2022-03-10 04:00:00
2022-03-10 05:00:00
2022-03-10 06:00:00
(6 rows)
Then, the middle-level query joins this series with the trips
table, joining rows if and only if any part of the trip took place during the hour-long interval beginning at the time given by the 'interval_start' column:
select interval_start,
greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
-- innermost query
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour'
) as interval_start
-- innermost query end
) intervals
join trips t
on t.start_date <= intervals.interval_start + interval '1 hour' and coalesce(t.end_date, '2022-03-10 06:00:00') >= intervals.interval_start
The two computed values represent respectively:
seconds_before_trip_started
- number of second passed between the beginning of the interval, and the beginning of the trip (or 0 if the trip begun prior to interval start). This is the time the trip didn't take place - thus we will be substructing it in the following stepseconds_before_trip_ended
- number of seconds passed between the end of the interval, and the end of the trip (or 3600 if the trip didn't end within concerned interval).The outermost query substracts the two beformentioned fields, effectively computing the time each trip took in each interval, and sums it for all trips, grouping by interval:
select
interval_start,
sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
-- middle-level query
select
interval_start,
greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour') as interval_start) i
join trips t
on t.start_date <= i.interval_start + interval '1 hour'
and coalesce(t.end_date, '2022-03-10 06:00:00') >= interval_start
-- middle-level query end
) subq
group by interval_start
order by interval_start;
In case we have another column in the table, and what we really need is the segmentation of the above result in respect to that column, we simply need to add it to the appropriate select
and group by
clauses (optionally to order by
clause as well).
Suppose there's an additional driver_id
column in the trips
table:
id | number_of_trip | start_date | end_date | seconds | driver_id
----+----------------+---------------------+---------------------+---------+-----------
1 | 637hui | 2022-03-10 01:20:00 | 2022-03-10 01:32:00 | 720 | 0
2 | 384nfj | 2022-03-10 02:18:00 | 2022-03-10 02:42:00 | 1440 | 0
3 | 102fiu | 2022-03-10 02:10:00 | 2022-03-10 02:23:00 | 780 | 1
4 | 948pvc | 2022-03-10 02:40:00 | 2022-03-10 03:20:00 | 2400 | 1
5 | 473mds | 2022-03-10 02:45:00 | 2022-03-10 02:58:00 | 780 | 1
6 | 103fkd | 2022-03-10 03:05:00 | 2022-03-10 03:28:00 | 1380 | 2
7 | 905783 | 2022-03-10 03:12:00 | | 0 | 2
8 | 498wsq | 2022-03-10 05:30:00 | 2022-03-10 05:48:00 | 1080 | 2
The modified query would look like that:
select
interval_start,
driver_id,
sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
select
interval_start,
driver_id,
greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour') as interval_start
) intervals
join trips t
on t.start_date <= intervals.interval_start + interval '1 hour'
and coalesce(t.end_date, '2022-03-10 06:00:00') >= intervals.interval_start
) subq
group by interval_start, driver_id
order by interval_start, driver_id;
and give the following result:
interval_start | driver_id | seconds
---------------------+-----------+---------
2022-03-10 01:00:00 | 0 | 720
2022-03-10 02:00:00 | 0 | 1440
2022-03-10 02:00:00 | 1 | 2760
2022-03-10 03:00:00 | 1 | 1200
2022-03-10 03:00:00 | 2 | 4260
2022-03-10 04:00:00 | 2 | 3600
2022-03-10 05:00:00 | 2 | 4680
2022-03-10 06:00:00 | 2 | 0
Upvotes: 1
Reputation: 41417
I have been thinking about getting all the data and solving the problem with pandas.
TLDR: Generate a range of minutes per trip, explode
those minutes into rows, and resample
those rows into hours to count
the minutes per hour:
import pandas as pd
df = pd.read_sql(...)
# convert to datetime dtype if not already
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
# fill missing end dates
current_time = pd.Timestamp('2022-03-10 04:00:00') # or pd.Timestamp.now()
df['end_date'] = df['end_date'].fillna(current_time)
# generate range of minutes per trip
df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1)
(df[['id', 'init_date']].explode('init_date') # explode minutes into rows
.set_index('init_date')['id'].resample('H').count() # count rows (minutes) per hour
.mul(60).reset_index(name='seconds')) # convert minutes to seconds
Output:
init_date seconds
2022-03-10 01:00:00 720
2022-03-10 02:00:00 4200
2022-03-10 03:00:00 5460
2022-03-10 04:00:00 0
2022-03-10 05:00:00 1080
Generate a date_range
of minutes from start_date
to end_date
per trip:
df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1)
# id number_of_trip ... init_date
# 1 637hui ... DatetimeIndex(['2022-03-10 01:20:00', '2022-03-10 01:21:00', ..., '2022-03-10 01:31:00'])
# 2 384nfj ... DatetimeIndex(['2022-03-10 02:18:00', '2022-03-10 02:19:00', ..., '2022-03-10 01:41:00'])
# 3 102fiu ... DatetimeIndex(['2022-03-10 02:10:00', '2022-03-10 02:11:00', ..., '2022-03-10 02:22:00'])
# 4 948pvc ... DatetimeIndex(['2022-03-10 02:40:00', '2022-03-10 02:41:00', ..., '2022-03-10 03:19:00'])
# 5 473mds ... DatetimeIndex(['2022-03-10 02:45:00', '2022-03-10 02:46:00', ..., '2022-03-10 02:57:00'])
# 6 103fkd ... DatetimeIndex(['2022-03-10 03:05:00', '2022-03-10 03:06:00', ..., '2022-03-10 03:27:00'])
# 7 905783 ... DatetimeIndex(['2022-03-10 03:12:00', '2022-03-10 03:13:00', ..., '2022-03-10 03:59:00'])
# 8 498wsq ... DatetimeIndex(['2022-03-10 05:30:00', '2022-03-10 05:31:00', ..., '2022-03-10 05:47:00'])
explode
the minutes into rows:
exploded = df[['init_date', 'id']].explode('init_date').set_index('init_date')['id']
# init_date
# 2022-03-10 01:20:00 1
# 2022-03-10 01:21:00 1
# 2022-03-10 01:22:00 1
# ..
# 2022-03-10 05:45:00 8
# 2022-03-10 05:46:00 8
# 2022-03-10 05:47:00 8
# Name: id, Length: 191, dtype: int64
resample
the rows into hours to count
the minutes per hour (× 60 to convert to seconds):
out = exploded.resample('H').count().mul(60).reset_index(name='seconds')
# init_date seconds
# 2022-03-10 01:00:00 720
# 2022-03-10 02:00:00 4200
# 2022-03-10 03:00:00 5460
# 2022-03-10 04:00:00 0
# 2022-03-10 05:00:00 1080
If I have a column with the driver id, how do I get a segmentation by hours and by driver id without reprocessing?
In this case, just change resample
to groupby.resample
. Select driver_id
before exploding, and group by driver_id
before resampling.
As a minimal example, I duplicated the sample data to create two driver_id
groups a
and b
:
# after preprocessing and creating init_date ...
(df[['driver_id', 'init_date']] # now include driver_id
.explode('init_date').set_index('init_date') # explode minutes into rows
.groupby('driver_id').resample('H').count() # count rows (minutes) per hour per driver_id
.mul(60).rename(columns={'driver_id': 'seconds'})) # convert minutes to seconds
# seconds
# driver_id init_date
# a 2022-03-10 01:00:00 720
# 2022-03-10 02:00:00 4200
# 2022-03-10 03:00:00 5460
# 2022-03-10 04:00:00 0
# 2022-03-10 05:00:00 1080
# b 2022-03-10 01:00:00 720
# 2022-03-10 02:00:00 4200
# 2022-03-10 03:00:00 5460
# 2022-03-10 04:00:00 0
# 2022-03-10 05:00:00 1080
Upvotes: 3
Reputation: 1982
Here is what works in sqlite (can be tested):
CREATE TABLE trips(
id INT PRIMARY KEY NOT NULL,
start_date TIMESTAMP,
end_date TIMESTAMP,
seconds INT
);
INSERT INTO trips(id, start_date, end_date, seconds) VALUES
(1, '2022-03-10 01:20:00', '2022-03-10 01:32:00', 720),
(2, '2022-03-10 02:18:00', '2022-03-10 02:42:00', 1440),
(3, '2022-03-10 02:10:00', '2022-03-10 02:23:00', 780),
(4, '2022-03-10 02:40:00', '2022-03-10 03:20:00', 2400),
(5, '2022-03-10 02:45:00', '2022-03-10 02:58:00', 780),
(6, '2022-03-10 03:05:00', '2022-03-10 03:28:00', 1380),
(7, '2022-03-10 03:12:00', NULL, 0),
(8, '2022-03-10 05:30:00', '2022-03-10 05:48:00', 1080);
WITH
checked AS (SELECT '2022-03-10 03:00:00' AS start, '2022-03-10 04:00:00' AS end)
SELECT
SUM(
IIF(end_date IS NULL, ROUND(MAX(0, (JULIANDAY(checked.end) - JULIANDAY(start_date)) * 24 * 60 * 60)),
MAX(
0,
(JULIANDAY(MIN(checked.end, end_date)) - JULIANDAY(MAX(checked.start, start_date))) /
(JULIANDAY(end_date) - JULIANDAY(start_date)) * seconds
)
)
)
FROM trips, checked;
DROP TABLE trips;
The code is simplified and sqlite lacks some features, but I think it will be easy to adapt :)
Briefly, the algorithm is:
This can be done for any interval with a start and end
Upvotes: 0