New true/false DataFrame based on another DataFrame with datetime ranges

Question

I have a problem that I can't solve for a long time. I want to generate new DataFrame with interpolated values (every one minute), and N additional columns, where N is length of set from ID column in base DataFrame.

My DataFrame (df) looks like this:

Datetime_ON           Datetime_OFF          ID
2020-01-02 15:21:49   2020-01-02 15:32:44   GB5U
2020-01-02 15:22:40   2020-01-02 15:24:14   GS8U
2020-01-02 15:45:25   2020-01-02 15:59:18   GS8U
2020-01-02 16:07:11   2020-01-02 16:16:50   GB5U

Now I want to create a new DataFrame (new_df) with interpolated values with datetime range from min(df['Datetime_ON']) to max(df['Datetime_OFF'])

new_df = pd.DataFrame(pd.date_range(start=min(df['Datetime_ON']), end=max(df['Datetime_OFF']), freq="1min"))

Datetime
2020-01-02 15:21:00
2020-01-02 15:22:00
2020-01-02 15:23:00
...
2020-01-02 16:16:00

Now I want to add N columns from set of ID column from df DrataFrame set(df['ID']) and put in that columns True/False or 1/0 value based on condition if Datetime column from new_df is between Datetime_ON and Datetime_OFF from df in any row with ID equal column name in new_df

My expected output:

Datetime              GB5U   GS8U
2020-01-02 15:21:00   False  False
2020-01-02 15:22:00   True   False
2020-01-02 15:23:00   True   True
2020-01-02 15:24:00   True   True
2020-01-02 15:25:00   True   False
...
2020-01-02 16:15:00   True   False
2020-01-02 16:16:00   True   False

The problem is that my dataframe has over 700k rows, and length of ID set is more than 100 (more than 100 new columns). I tried to use iterrows() and apply function but it is extremely slow. Is there a way to solve that problem in reasonable time?

Timus · Accepted Answer

First attempt

Here's a suggestion. I'm not sure if it's actually faster than what you've already tried, but maybe worth a try:

from functools import reduce
from operator import or_

# Just to make new_df as in the example
new_df = pd.DataFrame(pd.date_range('2020-01-02 15:21:00',
                                    '2020-01-02 16:16:00',
                                    freq="1min"),
                                    columns=['Datetime'])

# To make the following statement more compact
df.rename(columns={'Datetime_ON': 'ON', 'Datetime_OFF': 'OFF'}, inplace=True)

new_df = pd.concat(
        [new_df]
        + [
            (reduce(or_,((start <= new_df) & (new_df <= stop)
                         for start, stop in zip(group['ON'], group['OFF'])))
                    .rename({'Datetime': key}, axis='columns'))            
            for key, group in df.groupby('ID')
          ],
        axis='columns'
    ).set_index('Datetime', drop=True)

Result:

                      GB5U   GS8U
Datetime                         
2020-01-02 15:21:00  False  False
2020-01-02 15:22:00   True  False
2020-01-02 15:23:00   True   True
2020-01-02 15:24:00   True   True
2020-01-02 15:25:00   True  False
...
2020-01-02 16:15:00   True  False
2020-01-02 16:16:00   True  False

Second attempt

I thought a bit more about this problem and might have found a faster way. The idea is to set the True values more straight forward:

# new_df as in the example
new_df = pd.DataFrame(pd.date_range('2020-01-02 15:21:00',
                                    '2020-01-02 16:16:00',
                                    freq="1min"),
                                    columns=['Datetime'])

# To make the following statement more compact
df.rename(columns={'Datetime_ON': 'ON', 'Datetime_OFF': 'OFF'}, inplace=True)

# To make sure the columns are in datetime format
df.ON = pd.to_datetime(df.ON)
df.OFF = pd.to_datetime(df.OFF)

idx = new_df.Datetime
dfs = []
for key, group in df.groupby('ID'):
    trues = [
                minute
                for start, stop in zip(group.ON, group.OFF)
                for minute in pd.date_range(start.ceil('min'), stop.floor('min'), freq='1min')
            ]
    df_group = pd.DataFrame({key: False}, index=idx)
    df_group.loc[trues, key] = True
    dfs.append(df_group)

new_df = pd.concat(dfs, axis='columns')

For the example the results are identical. And I think that should be the case in general, but I'm not yet a 100% sure. And, again, I'm also not sure about the efficiency because a lot depends on the real data.

New true/false DataFrame based on another DataFrame with datetime ranges

Answers (1)

First attempt

Second attempt

Related Questions