MarkD
MarkD

Reputation: 4954

pandas - Splitting date ranges on specific day boundary

I've got a DataFrame of date ranges (the actual DataFrame has more data attached to it but has the same start and end columns). The data ultimately needs to be analyzed week-by-week on a Sunday-Saturday basis. Thus, I'd like to go through the DataFrame, and split any date ranges (start to finish) that cross from a Saturday to Sunday. For example, given the DataFrame:

import pandas as pd

date_ranges = [
    {'start': '2020-01-16 22:30:00', 'end': '2020-01-17 01:00:00'}, # spans thurs-fri, ok as is
    {'start': '2020-01-17 04:30:00', 'end': '2020-01-17 12:30:00'}, # no span, ok as is
    {'start': '2020-01-18 10:15:00', 'end': '2020-01-18 14:00:00'}, # no span, ok as is
    {'start': '2020-01-18 22:30:00', 'end': '2020-01-19 02:00:00'}  # spans sat-sun, must split
]
data_df = pd.DataFrame(date_ranges)

I want my result to look like:

result_ranges = [
    {'start': '2020-01-16 22:30:00', 'end': '2020-01-17 01:00:00'}, # spans thurs-fri, ok as is
    {'start': '2020-01-17 04:30:00', 'end': '2020-01-17 12:30:00'}, # no span, ok as is
    {'start': '2020-01-18 10:15:00', 'end': '2020-01-18 14:00:00'}, # no span, ok as is
    {'start': '2020-01-18 22:30:00', 'end': '2020-01-19 00:00:00'}, # split out saturday portion
    {'start': '2020-01-19 00:00:00', 'end': '2020-01-19 02:00:00'}  # and the sunday portion
]

result_df = pd.DataFrame(result_ranges)

Any thoughts on how to effectively do this in pandas would be greatly appreciated. Currently I am doing the bad thing, and iterating over rows, and it is quite slow when the data set gets large.

Upvotes: 5

Views: 2583

Answers (3)

filbranden
filbranden

Reputation: 8898

Similar to @Valdi_Bo's answer, I looked into breaking down a single interval of (start, end) into a series of intervals, including all the midnights of Sundays in between.

This is accomplished by the following function:

def break_weekly(start, end):
    edges = list(pd.date_range(start, end, freq='W', normalize=True, closed='right'))
    if edges and edges[-1] == end:
        edges.pop()
    return pd.Series(list(zip([start] + edges, edges + [end])))

This code will create a weekly date range from "start" to "end", normalizing to midnight time (so Sunday midnight) and will keep the interval open on the left (so it starts on the Sunday following start.)

There's a corner case for when "end" is exactly midnight on Sunday, since the interval needs to be closed on one side, we're keeping it closed on the right, so we're checking whether those two match and drop it if they're the same.

We then use zip() to create tuples with each pairs of dates, including the "start" at the beginning in the left, and the "end" timestamp at the end of the right.

We finally return a pd.Series of those tuples, since that makes apply() do what we expect.

Example usage:

>>> break_weekly(pd.Timestamp('2020-01-18 22:30:00'), pd.Timestamp('2020-01-19 02:00:00'))
0    (2020-01-18 22:30:00, 2020-01-19 00:00:00)
1    (2020-01-19 00:00:00, 2020-01-19 02:00:00)
dtype: object

At this point, you can apply it to the original data frame to find the complete list of intervals.

First, convert the types of the columns to pd.Timestamp (you have strings in the columns in your example):

data_df = data_df.apply(pd.to_datetime)

Then you can find the whole list of intervals with:

intervals = (data_df
    .apply(lambda r: break_weekly(r.start, r.end), axis=1)
    .unstack().dropna().reset_index(level=0, drop=True)
    .apply(lambda r: pd.Series(r, index=['start', 'end'])))

The first step applies break_weekly() to the "start" and "end" columns, row by row. Since break_weekly() returns a pd.Series, it will end up producing a new DataFrame with one column per date interval (as many as there are weeks in an interval).

Then unstack() will merge those columns back together, and dropna() will drop the NaN that were generated because each row had a different number of columns (different number of intervals for each row.)

At this point we have a multi-index, so reset_index(level=0, drop=True) will drop the index level we don't care about and only keep the one that matches the original DataFrame.

Finally, the last apply() will convert the entries from Python tuples back to a pd.Series and will name the columns "start" and "end" again.

Looking at the result up until this point:

>>> intervals
                start                 end
0 2020-01-16 22:30:00 2020-01-17 01:00:00
1 2020-01-17 04:30:00 2020-01-17 12:30:00
2 2020-01-18 10:15:00 2020-01-18 14:00:00
3 2020-01-18 22:30:00 2020-01-19 00:00:00
3 2020-01-19 00:00:00 2020-01-19 02:00:00

Since the indices match the ones from your original DataFrame, you can now use this DataFrame to connect it back to your original one, if you had more columns with values there and you want to duplicate those here, it's just a matter of joining them together.

For example:

>>> data_df['value'] = ['abc', 'def', 'ghi', 'jkl']
>>> intervals.join(df.drop(['start', 'end'], axis=1))
                start                 end value
0 2020-01-16 22:30:00 2020-01-17 01:00:00   abc
1 2020-01-17 04:30:00 2020-01-17 12:30:00   def
2 2020-01-18 10:15:00 2020-01-18 14:00:00   ghi
3 2020-01-18 22:30:00 2020-01-19 00:00:00   jkl
3 2020-01-19 00:00:00 2020-01-19 02:00:00   jkl

You'll notice that the value in the last row has been copied to both rows in that interval.

Upvotes: 0

ALollz
ALollz

Reputation: 59579

Manipulations like this are always difficult and at some level I think a loop is necessary. In this case, instead of looping over the rows, we can loop over the edges. This should lead to a rather big gain in performance when the number of weeks your data span is much smaller than the number of rows you have.

We define edges and modify the DataFrame endpoints where necessary. In the end the desired DataFrame is whatever is left of the DataFrame we modified, plus all the separate timespans we stored in l. The original Index is preserved, so you can see exactly what rows were split. If a single timespan straddles N edges it gets split into N+1 separate rows.

Setup

import pandas as pd

df[['start', 'end']]= df[['start', 'end']].apply(pd.to_datetime)

edges = pd.date_range(df.start.min().normalize() - pd.Timedelta(days=7),
                      df.end.max().normalize() + pd.Timedelta(days=7), freq='W-Sun')

Code

l = []
for edge in edges:
    m = df.start.lt(edge) & df.end.gt(edge)  # Rows to modify
    l.append(df.loc[m].assign(end=edge))     # Clip end of modified rows
    df.loc[m, 'start'] = edge                # Fix start for next edge

result = pd.concat(l+[df]).sort_values('start')

Output

                start                 end
0 2020-01-16 22:30:00 2020-01-17 01:00:00
1 2020-01-17 04:30:00 2020-01-17 12:30:00
2 2020-01-18 10:15:00 2020-01-18 14:00:00
3 2020-01-18 22:30:00 2020-01-19 00:00:00
3 2020-01-19 00:00:00 2020-01-19 02:00:00

Upvotes: 8

Valdi_Bo
Valdi_Bo

Reputation: 31011

My solution is even more general that you defined, namely it creates a sequence of "week rows" from each source row, even if both dates contain between them e.g. two Sat/Sun breaks.

To check that it works, I added one such row to your DataFrame, so that it contains:

                start                 end
0 2020-01-16 22:30:00 2020-01-17 01:00:00
1 2020-01-17 04:30:00 2020-01-17 12:30:00
2 2020-01-18 10:15:00 2020-01-18 14:00:00
3 2020-01-18 22:30:00 2020-01-19 02:00:00
4 2020-01-25 20:30:00 2020-02-02 03:00:00

Note that the last row includes two Sat/Sun break, from 25.01 to 26.01 and from 1.02 to 2.02.

Start from conversion of both columns to datetime:

data_df.start = pd.to_datetime(data_df.start)
data_df.end = pd.to_datetime(data_df.end)

To process your data, define the following function, to be applied to each row:

def weekRows(row):
    row.index = pd.DatetimeIndex(row)
    gr = row.resample('W-SUN', closed='left')
    ngr = gr.ngroups  # Number of groups
    i = 1
    data = []
    for key, grp in gr:
        dt1 = key - pd.Timedelta('7D')
        dt2 = key
        if i == 1:
            dt1 = row.iloc[0]
        if i == ngr:
            dt2 = row.iloc[1]
        data.append([dt1, dt2])
        i += 1
    return pd.DataFrame(data, columns=['start', 'end'])

Let's present "individually", how it operates on 2 last rows:

When you run:

row = data_df.loc[3]
weekRows(row)

(for the last but one row), you will get:

                start                 end
0 2020-01-18 22:30:00 2020-01-19 00:00:00
1 2020-01-19 00:00:00 2020-01-19 02:00:00

And when you run:

row = data_df.loc[4]
weekRows(row)

(for the last), you will get:

                start                 end
0 2020-01-25 20:30:00 2020-01-26 00:00:00
1 2020-01-26 00:00:00 2020-02-02 00:00:00
2 2020-02-02 00:00:00 2020-02-02 03:00:00

And to get your desired result, run:

result = pd.concat(data_df.apply(weekRows, axis=1).values, ignore_index=True)

The result is:

                start                 end
0 2020-01-16 22:30:00 2020-01-17 01:00:00
1 2020-01-17 04:30:00 2020-01-17 12:30:00
2 2020-01-18 10:15:00 2020-01-18 14:00:00
3 2020-01-18 22:30:00 2020-01-19 00:00:00
4 2020-01-19 00:00:00 2020-01-19 02:00:00
5 2020-01-25 20:30:00 2020-01-26 00:00:00
6 2020-01-26 00:00:00 2020-02-02 00:00:00
7 2020-02-02 00:00:00 2020-02-02 03:00:00

First 3 rows result from your first 3 source rows. Two next rows (index 3 and 4) result from source row with index 3. And the last 3 row (index 5 thru 7) result from the last source row.

Upvotes: 3

Related Questions