Reputation: 302

how to segment and get the time between two dates?

I have the following table:

id | number_of _trip |      start_date      |      end_date       | seconds
1     637hui           2022-03-10 01:20:00    2022-03-10 01:32:00    720  
2     384nfj           2022-03-10 02:18:00    2022-03-10 02:42:00    1440
3     102fiu           2022-03-10 02:10:00    2022-03-10 02:23:00    780
4     948pvc           2022-03-10 02:40:00    2022-03-10 03:20:00    2400
5     473mds           2022-03-10 02:45:00    2022-03-10 02:58:00    780
6     103fkd           2022-03-10 03:05:00    2022-03-10 03:28:00    1380
7     905783           2022-03-10 03:12:00             null           0 
8     498wsq           2022-03-10 05:30:00    2022-03-10 05:48:00    1080

I want to get the time that is driven for each hour, but if a trip takes the space of two hours, the time must be taken for each hour. If the end of the trip has not yet finished, the end_date field is null, but it must count the time it is taking in the respective hours from start_date.

I have the following query:

SELECT time_bucket(bucket_width := INTERVAL '1 hour',ts := start_date, "offset" := '0 minutes') AS init_date,
       sum(seconds) as seconds
        FROM trips
        WHERE start_date >= '2022-03-10 01:00:00' AND start_date <= '2022-03-10 06:00:00'
        GROUP BY init_date
        ORDER BY init_date;

The result is:

|   init_date         | seconds 
  2022-03-10 01:00:00    720
  2022-03-10 02:00:00    5400
  2022-03-10 03:00:00    1380
  2022-03-10 05:00:00    1080

However I expect to receive a result like this:

|   init_date         | seconds     solo como una ayuda visual
  2022-03-10 01:00:00    720          id(1:720)
  2022-03-10 02:00:00    4200         id(2: 1440 3: 780 4: 1200 5: 780)
  2022-03-10 03:00:00    5460         id(4:1200 6:1380 7:2880)
  2022-03-10 05:00:00    1080         id(8:1080)

EDIT
If I replace the null the result is still unwanted:

|   init_date       | seconds 
2022-03-10 01:00:00   720
2022-03-10 02:00:00   5400
2022-03-10 03:00:00   1380
2022-03-10 05:00:00   1080

I have been thinking about getting all the data and solving the problem with pandas. I'll try and post if I get the answer. EDIT

My previous result was not entirely correct, since there were hours left of a trip that has not yet finished, the correct result should be:

       start_date  seconds
0 2022-03-10 01:00:00      720
1 2022-03-10 02:00:00     4200
2 2022-03-10 03:00:00     5460
3 2022-03-10 04:00:00     3600
4 2022-03-10 05:00:00     4680

NEW CODE

def bucket_count(bucket, data):
    result = pd.DataFrame()
    list_r = []

    for row_bucket in bucket.to_dict('records'):
        inicio = row_bucket['start_date']
        fin = row_bucket['end_date']

        df = data[
                (inicio <= data['end_date']) & (inicio <= fin) & (data['start_date'] <= fin) & (data['start_date'] <= data['end_date'])
        ]
        df_dict = df.to_dict('records')

        for row in df_dict:
            seconds = 0
            if row['start_date'] >= inicio and fin >= row['end_date']:
                seconds = (row['end_date'] - row['start_date']).total_seconds()
            elif row['start_date'] <= inicio <= row['end_date'] <= fin:
                seconds = (row['end_date'] - inicio).total_seconds()
            elif inicio <= row['start_date'] <= fin <= row['end_date']:
                seconds = (fin - row['start_date']).total_seconds()
            elif row['start_date'] < inicio and fin < row['end_date']:
                seconds = (fin - inicio).total_seconds()

            row['start_date'] = inicio
            row['end_date'] = fin
            row['seconds'] = seconds
            list_r.append(row)

    result = pd.DataFrame(list_r)
    return result.groupby(['start_date'])["seconds"].apply(lambda x: x.astype(int).sum()).reset_index()

Upvotes: 6

Answers (4)

Riley

Reputation: 2271

This answer will use staircase, which is built upon pandas and numpy, and operates as part of the pandas ecosystem.

Your data describes intervals, which can be thought of as step functions which have a value of 1 during the interval and 0 otherwise. Using staircase we will add the step functions for each trip together, slice the step function into hour buckets, and then integrate to get the total time for each bucket.

setup

Dataframe with pandas.Timestamp. The trip number not relevant in this solution.

df = pd.DataFrame({
    "start_date": [
        pd.Timestamp("2022-03-10 1:20"),
        pd.Timestamp("2022-03-10 2:18"),
        pd.Timestamp("2022-03-10 2:10"),
        pd.Timestamp("2022-03-10 2:40"),
        pd.Timestamp("2022-03-10 2:45"),
        pd.Timestamp("2022-03-10 3:05"),
        pd.Timestamp("2022-03-10 3:12"),
        pd.Timestamp("2022-03-10 5:30"),
    ],
    "end_date": [
        pd.Timestamp("2022-03-10 1:32"),
        pd.Timestamp("2022-03-10 2:42"),
        pd.Timestamp("2022-03-10 2:23"),
        pd.Timestamp("2022-03-10 3:20"),
        pd.Timestamp("2022-03-10 2:58"),
        pd.Timestamp("2022-03-10 3:28"),
        pd.NaT,
        pd.Timestamp("2022-03-10 5:48"),
    ],
})

solution

import staircase as sc

# create step function
# the Stairs class represents a step function.  It is to staircase as DataFrame is to pandas.
sf = sc.Stairs(df, start="start_date", end="end_date")

# you could visually inspect it if you want
sf.plot(style="hlines")

From inspection you can see the maximum concurrent trips is 3. Also note the step function continues on to infinity with a value of 1 - this is because we do not know the end date for one of the records.

# define hourly buckets as pandas PeriodIndex
hour_buckets = pd.period_range("2022-03-10 1:00", "2022-03-10 5:00", freq="H")

# integrate the step function over the hourly buckets
total_per_hour = sf.slice(hour_buckets).integral()

total_per_hour is a pandas.Series of pandas.Timedelta values and indexed by a pandas.IntervalIndex. It looks like this

[2022-03-10 01:00:00, 2022-03-10 02:00:00)   0 days 00:12:00
[2022-03-10 02:00:00, 2022-03-10 03:00:00)   0 days 01:10:00
[2022-03-10 03:00:00, 2022-03-10 04:00:00)   0 days 01:31:00
[2022-03-10 04:00:00, 2022-03-10 05:00:00)   0 days 01:00:00
[2022-03-10 05:00:00, 2022-03-10 06:00:00)   0 days 01:18:00
dtype: timedelta64[ns]

If you want a dataframe format where only the left side of the interval is referenced, and time is given as seconds, then use the following

pd.DataFrame({
    "init_date":total_per_hour.index.left,
    "seconds":total_per_hour.dt.total_seconds().values,
})

to summarise

The solution is

import staircase as sc

hour_buckets = pd.period_range("2022-03-10 1:00", "2022-03-10 5:00", freq="H")
total_per_hour = sc.Stairs(df, start="start_date", end="end_date").slice(hour_buckets).integral()

# optional
total_per_hour = pd.DataFrame({
    "init_date":total_per_hour.index.left,
    "seconds":total_per_hour.dt.total_seconds().values,
})

note 1

In your expected answer you do not have values for 2022-03-10 04:00:00.

This seems inconsistent with the fact that time for trip 905783 (with no end date) is being included for 2022-03-10 03:00:00 but not subsequent hours.

The solution proposed here includes 3600s for 2022-03-10 04:00:00 and 2022-03-10 05:00:00 which is why it differs from the expected solution in the original question.

note 2

If your dataframe has a "driver" column and you want to tally time per driver then the following will work

def make_total_by_hour(df_):
    return sc.Stairs(df_, "start_date", "end_date").slice(hour_buckets).integral()

total_per_hour = (
    df.groupby("driver")
    .apply(make_total_by_hour)
    .melt(ignore_index=False)
    .reset_index()
)

note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.

Upvotes: 0

tyrrr

Reputation: 538

This can be done in plain sql (apart from time_bucket function), in a nested sql query:

select 
    interval_start, 
    sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
    select 
        interval_start,
        greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
        least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
    from (
        select generate_series(
            (select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
            (select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
            '1 hour') as interval_start) i
    join trips t
        on t.start_date <= i.interval_start + interval '1 hour'
        and coalesce(t.end_date, '2022-03-10 06:00:00') >= interval_start
    ) subq
group by interval_start
order by interval_start;

This gives me the following result:

   interval_start    | seconds
---------------------+---------
 2022-03-10 01:00:00 |     720
 2022-03-10 02:00:00 |    4200
 2022-03-10 03:00:00 |    5460
 2022-03-10 04:00:00 |    3600
 2022-03-10 05:00:00 |    4680
 2022-03-10 06:00:00 |       0
(6 rows)

Explanation

Let's break the query down.

In the innermost query:

select generate_series(
        (select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
        (select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
        '1 hour'
    ) as interval_start

we generate a series of time interval starts - from minimal start_date value up to the maximal end_time value, truncated to full hours, with 1-hour step. Each boundary can obviously be replaced with an arbitrary datetime. Direct result of this query is the following:

   interval_start
---------------------
 2022-03-10 01:00:00
 2022-03-10 02:00:00
 2022-03-10 03:00:00
 2022-03-10 04:00:00
 2022-03-10 05:00:00
 2022-03-10 06:00:00
(6 rows)

Then, the middle-level query joins this series with the trips table, joining rows if and only if any part of the trip took place during the hour-long interval beginning at the time given by the 'interval_start' column:

select interval_start,
    greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
    least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
    -- innermost query
    select generate_series(
        (select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
        (select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
        '1 hour'
    ) as interval_start
    -- innermost query end
) intervals
join trips t
    on t.start_date <= intervals.interval_start + interval '1 hour' and coalesce(t.end_date, '2022-03-10 06:00:00') >= intervals.interval_start

The two computed values represent respectively:

seconds_before_trip_started - number of second passed between the beginning of the interval, and the beginning of the trip (or 0 if the trip begun prior to interval start). This is the time the trip didn't take place - thus we will be substructing it in the following step
seconds_before_trip_ended - number of seconds passed between the end of the interval, and the end of the trip (or 3600 if the trip didn't end within concerned interval).

The outermost query substracts the two beformentioned fields, effectively computing the time each trip took in each interval, and sums it for all trips, grouping by interval:

select 
    interval_start, 
    sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
-- middle-level query
    select 
        interval_start,
        greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
        least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
    from (
        select generate_series(
            (select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
            (select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
            '1 hour') as interval_start) i
    join trips t
        on t.start_date <= i.interval_start + interval '1 hour'
        and coalesce(t.end_date, '2022-03-10 06:00:00') >= interval_start
-- middle-level query end
    ) subq
group by interval_start
order by interval_start;

Additional grouping

In case we have another column in the table, and what we really need is the segmentation of the above result in respect to that column, we simply need to add it to the appropriate select and group by clauses (optionally to order by clause as well).

Suppose there's an additional driver_id column in the trips table:

 id | number_of_trip |     start_date      |      end_date       | seconds | driver_id
----+----------------+---------------------+---------------------+---------+-----------
  1 | 637hui         | 2022-03-10 01:20:00 | 2022-03-10 01:32:00 |     720 |         0
  2 | 384nfj         | 2022-03-10 02:18:00 | 2022-03-10 02:42:00 |    1440 |         0
  3 | 102fiu         | 2022-03-10 02:10:00 | 2022-03-10 02:23:00 |     780 |         1
  4 | 948pvc         | 2022-03-10 02:40:00 | 2022-03-10 03:20:00 |    2400 |         1
  5 | 473mds         | 2022-03-10 02:45:00 | 2022-03-10 02:58:00 |     780 |         1
  6 | 103fkd         | 2022-03-10 03:05:00 | 2022-03-10 03:28:00 |    1380 |         2
  7 | 905783         | 2022-03-10 03:12:00 |                     |       0 |         2
  8 | 498wsq         | 2022-03-10 05:30:00 | 2022-03-10 05:48:00 |    1080 |         2

The modified query would look like that:

select
    interval_start,
    driver_id,
    sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
    select 
        interval_start,
        driver_id,
        greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
        least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
    from (
        select generate_series(
            (select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
            (select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
            '1 hour') as interval_start
    ) intervals
    join trips t
        on t.start_date <= intervals.interval_start + interval '1 hour'
        and coalesce(t.end_date, '2022-03-10 06:00:00') >= intervals.interval_start
) subq
group by interval_start, driver_id
order by interval_start, driver_id;

and give the following result:

   interval_start    | driver_id | seconds
---------------------+-----------+---------
 2022-03-10 01:00:00 |         0 |     720
 2022-03-10 02:00:00 |         0 |    1440
 2022-03-10 02:00:00 |         1 |    2760
 2022-03-10 03:00:00 |         1 |    1200
 2022-03-10 03:00:00 |         2 |    4260
 2022-03-10 04:00:00 |         2 |    3600
 2022-03-10 05:00:00 |         2 |    4680
 2022-03-10 06:00:00 |         2 |       0

Upvotes: 1

tdy

Reputation: 41417

I have been thinking about getting all the data and solving the problem with pandas.

TLDR: Generate a range of minutes per trip, explode those minutes into rows, and resample those rows into hours to count the minutes per hour:

import pandas as pd
df = pd.read_sql(...)

# convert to datetime dtype if not already
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])

# fill missing end dates
current_time = pd.Timestamp('2022-03-10 04:00:00') # or pd.Timestamp.now()
df['end_date'] = df['end_date'].fillna(current_time)

# generate range of minutes per trip
df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1)

(df[['id', 'init_date']].explode('init_date')          # explode minutes into rows
   .set_index('init_date')['id'].resample('H').count() # count rows (minutes) per hour
   .mul(60).reset_index(name='seconds'))               # convert minutes to seconds

Output:

init_date            seconds
2022-03-10 01:00:00  720
2022-03-10 02:00:00  4200
2022-03-10 03:00:00  5460
2022-03-10 04:00:00  0
2022-03-10 05:00:00  1080

Step-by-step breakdown

Generate a date_range of minutes from start_date to end_date per trip:

df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1)

# id  number_of_trip  ...  init_date
# 1   637hui          ...  DatetimeIndex(['2022-03-10 01:20:00', '2022-03-10 01:21:00', ..., '2022-03-10 01:31:00'])
# 2   384nfj          ...  DatetimeIndex(['2022-03-10 02:18:00', '2022-03-10 02:19:00', ..., '2022-03-10 01:41:00'])
# 3   102fiu          ...  DatetimeIndex(['2022-03-10 02:10:00', '2022-03-10 02:11:00', ..., '2022-03-10 02:22:00'])
# 4   948pvc          ...  DatetimeIndex(['2022-03-10 02:40:00', '2022-03-10 02:41:00', ..., '2022-03-10 03:19:00'])
# 5   473mds          ...  DatetimeIndex(['2022-03-10 02:45:00', '2022-03-10 02:46:00', ..., '2022-03-10 02:57:00'])
# 6   103fkd          ...  DatetimeIndex(['2022-03-10 03:05:00', '2022-03-10 03:06:00', ..., '2022-03-10 03:27:00'])
# 7   905783          ...  DatetimeIndex(['2022-03-10 03:12:00', '2022-03-10 03:13:00', ..., '2022-03-10 03:59:00'])
# 8   498wsq          ...  DatetimeIndex(['2022-03-10 05:30:00', '2022-03-10 05:31:00', ..., '2022-03-10 05:47:00'])

explode the minutes into rows:

exploded = df[['init_date', 'id']].explode('init_date').set_index('init_date')['id']

# init_date
# 2022-03-10 01:20:00    1
# 2022-03-10 01:21:00    1
# 2022-03-10 01:22:00    1
#                       ..
# 2022-03-10 05:45:00    8
# 2022-03-10 05:46:00    8
# 2022-03-10 05:47:00    8
# Name: id, Length: 191, dtype: int64

resample the rows into hours to count the minutes per hour (× 60 to convert to seconds):

out = exploded.resample('H').count().mul(60).reset_index(name='seconds')

# init_date            seconds
# 2022-03-10 01:00:00  720
# 2022-03-10 02:00:00  4200
# 2022-03-10 03:00:00  5460
# 2022-03-10 04:00:00  0
# 2022-03-10 05:00:00  1080

Driver IDs

If I have a column with the driver id, how do I get a segmentation by hours and by driver id without reprocessing?

In this case, just change resample to groupby.resample. Select driver_id before exploding, and group by driver_id before resampling.

As a minimal example, I duplicated the sample data to create two driver_id groups a and b:

# after preprocessing and creating init_date ...

(df[['driver_id', 'init_date']]                        # now include driver_id
    .explode('init_date').set_index('init_date')       # explode minutes into rows
    .groupby('driver_id').resample('H').count()        # count rows (minutes) per hour per driver_id
    .mul(60).rename(columns={'driver_id': 'seconds'})) # convert minutes to seconds

#                                seconds
# driver_id init_date                   
# a         2022-03-10 01:00:00  720
#           2022-03-10 02:00:00  4200
#           2022-03-10 03:00:00  5460
#           2022-03-10 04:00:00  0
#           2022-03-10 05:00:00  1080
# b         2022-03-10 01:00:00  720
#           2022-03-10 02:00:00  4200
#           2022-03-10 03:00:00  5460
#           2022-03-10 04:00:00  0
#           2022-03-10 05:00:00  1080

Upvotes: 3

qqNade

Reputation: 1982

Here is what works in sqlite (can be tested):

CREATE TABLE trips(
  id INT PRIMARY KEY NOT NULL,
  start_date TIMESTAMP,
  end_date TIMESTAMP,
  seconds INT
);

INSERT INTO trips(id, start_date, end_date, seconds) VALUES
  (1, '2022-03-10 01:20:00', '2022-03-10 01:32:00', 720),
  (2, '2022-03-10 02:18:00', '2022-03-10 02:42:00', 1440),
  (3, '2022-03-10 02:10:00', '2022-03-10 02:23:00', 780),
  (4, '2022-03-10 02:40:00', '2022-03-10 03:20:00', 2400),
  (5, '2022-03-10 02:45:00', '2022-03-10 02:58:00', 780),
  (6, '2022-03-10 03:05:00', '2022-03-10 03:28:00', 1380),
  (7, '2022-03-10 03:12:00', NULL, 0),
  (8, '2022-03-10 05:30:00', '2022-03-10 05:48:00', 1080);

WITH 
  checked AS (SELECT '2022-03-10 03:00:00' AS start, '2022-03-10 04:00:00' AS end)
SELECT
  SUM(
    IIF(end_date IS NULL, ROUND(MAX(0, (JULIANDAY(checked.end) - JULIANDAY(start_date)) * 24 * 60 * 60)),
      MAX(
        0,
        (JULIANDAY(MIN(checked.end, end_date)) - JULIANDAY(MAX(checked.start, start_date))) / 
          (JULIANDAY(end_date) - JULIANDAY(start_date)) * seconds
      )
    )
  )
FROM trips, checked;

DROP TABLE trips;

The code is simplified and sqlite lacks some features, but I think it will be easy to adapt :)

Briefly, the algorithm is:

If end_time = NULL, then:
- Calculate the number of seconds from the start of the trip to the end of the interval
- Throw away negative values
Otherwise:
- Calculate what part of trip in seconds we need within one interval
- Throw away negative values
Sum the values

This can be done for any interval with a start and end