Gustav Fjorder
Gustav Fjorder

Reputation: 31

Is there a faster way to iterate through a DataFrame?

I am going through a Pandas DataFrame of time slots, trying to compare each time slot to the other slots of the day to find double bookings.

The script takes a while to run. Is there a faster way to do this?

df_temp = pd.DataFrame()
for date in df_cal["date"].unique():
    df_date = df_cal[df_cal["date"]==date]
    for current in range(len(df_date)):
        for comp in range(current+1,df_date[df_date["Start"]<df_date.iloc[current]["End"]]["Start"].idxmax()+1):
            df_date.loc[comp,"Double booked"] = True
            df_date.loc[current,"Double booked"] = True
            df_date.loc[comp,"Time_removed"] = max(df_date.loc[comp,"Time_removed"],pd.Timedelta(min(df_date.iloc[current]["End"] - df_date.iloc[comp]["Start"],\
                                                           df_date.iloc[comp]["End"] - df_date.iloc[comp]["Start"])))

    df_temp = pd.concat([df_temp,df_date])

Columns are [["MEET_ID","date","Start","End","double_booked","Time_removed"]]

[[1943,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 09:00:00'),
  Timestamp('2017-05-01 09:30:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1907,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 10:00:00'),
  Timestamp('2017-05-01 11:00:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1913,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 11:00:00'),
  Timestamp('2017-05-01 12:00:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1956,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 12:00:00'),
  Timestamp('2017-05-01 12:30:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1905,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 12:30:00'),
  Timestamp('2017-05-01 13:00:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1914,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 12:30:00'),
  Timestamp('2017-05-01 13:00:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1940,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 13:00:00'),
  Timestamp('2017-05-01 16:00:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1958,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 14:30:00'),
  Timestamp('2017-05-01 15:30:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1892,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 16:00:00'),
  Timestamp('2017-05-01 16:30:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1929,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 16:30:00'),
  Timestamp('2017-05-01 17:00:00'),
  False,
  Timedelta('0 days 00:00:00')],
 [1887,
  Timestamp('2017-05-01 00:00:00'),
  Timestamp('2017-05-01 17:30:00'),
  Timestamp('2017-05-01 18:00:00'),
  False,
  Timedelta('0 days 00:00:00')]]

Which should then yield something like this, where double booked meetings are marked as such and the overlapping time is removed from one of the meetings (here it is removed from the second one) Columns are [["MEET_ID","Start","End","Time_removed","double_booked"]]

[[1943,
  Timestamp('2017-05-01 09:00:00'),
  Timestamp('2017-05-01 09:30:00'),
  Timedelta('0 days 00:00:00'),
  False],
 [1907,
  Timestamp('2017-05-01 10:00:00'),
  Timestamp('2017-05-01 11:00:00'),
  Timedelta('0 days 00:00:00'),
  False],
 [1913,
  Timestamp('2017-05-01 11:00:00'),
  Timestamp('2017-05-01 12:00:00'),
  Timedelta('0 days 00:00:00'),
  False],
 [1956,
  Timestamp('2017-05-01 12:00:00'),
  Timestamp('2017-05-01 12:30:00'),
  Timedelta('0 days 00:00:00'),
  False],
 [1905,
  Timestamp('2017-05-01 12:30:00'),
  Timestamp('2017-05-01 13:00:00'),
  Timedelta('0 days 00:00:00'),
  False],
 [1914,
  Timestamp('2017-05-01 12:30:00'),
  Timestamp('2017-05-01 13:00:00'),
  Timedelta('0 days 00:30:00'),
  True],
 [1940,
  Timestamp('2017-05-01 13:00:00'),
  Timestamp('2017-05-01 16:00:00'),
  Timedelta('0 days 00:00:00'),
  True],
 [1958,
  Timestamp('2017-05-01 14:30:00'),
  Timestamp('2017-05-01 15:30:00'),
  Timedelta('0 days 01:00:00'),
  True],
 [1892,
  Timestamp('2017-05-01 16:00:00'),
  Timestamp('2017-05-01 16:30:00'),
  Timedelta('0 days 00:00:00'),
  False],
 [1929,
  Timestamp('2017-05-01 16:30:00'),
  Timestamp('2017-05-01 17:00:00'),
  Timedelta('0 days 00:00:00'),
  False],
 [1887,
  Timestamp('2017-05-01 17:30:00'),
  Timestamp('2017-05-01 18:00:00'),
  Timedelta('0 days 00:00:00'),
  False]]

Edit new data 09/07/2018:

    Start               End                 Time_removed  Double booked
77  2018-07-02 00:00:00 2018-07-02 10:00:00 00:00:00      True
78  2018-07-02 03:00:00 2018-07-02 08:00:00 05:00:00      True
79  2018-07-02 03:00:00 2018-07-02 08:00:00 05:00:00      True
80  2018-07-02 04:30:00 2018-07-02 09:30:00 03:30:00      True
81  2018-07-02 05:00:00 2018-07-02 10:00:00 04:30:00      True
82  2018-07-02 05:00:00 2018-07-02 10:00:00 05:00:00      True

Row 80 should remove 5 hours but only removes 3:30 because it compares to the one row before it. It must have previously computed Time_removed between row 77 and 80 but then it gets replaced by a smaller timediff.

Upvotes: 3

Views: 1343

Answers (1)

a_guest
a_guest

Reputation: 36239

Looks like a job for DataFrame.groupby. You can also use numpy's outer product for eliminating the inner double for loop.

def process_data(df):
    pos = np.argwhere(np.less.outer(df['start'], df['end']))
    indices = df.index[pos]
    unique = indices.ravel().unique()
    date_diff = np.subtract.outer(df['end'], df['start']).max(axis=0)
    return pd.DataFrame(
        data=np.asarray([
            [True]*len(indices),
            np.where(
                np.isin(unique, indices[:, 1]),
                date_diff,
                np.NaN
            )
        ]).T,
        columns=['Double booked', 'Time_removed'],
        index=unique
    )

df_cal.groupby('date').apply(process_data)

In any case, this is based on the snippet of the OP only, and without any example data frame and example output (i.e. some kind of unit test), it is hard to say whether it really solves the issue.

Also you have to make sure to not confuse index and position. In your question you seem to mix .loc and .iloc as well as usage of range. I'm not sure this is giving the result you want.

Edit

From the data added to OP it looks like the 'Date' variable is actually dependent on the 'Start' variable (namely being just the date of the 'Start' datetime value). Given that this holds we can spare the application of groupby and directly apply the outer product in order to get the overlapping items:

overlapping = np.less_equal.outer(df['Start'], df['Start']) & np.greater.outer(df['End'], df['Start'])
overlapping &= ~np.identity(len(df), dtype=bool)  # Meetings are overlapping with themselves; need to remove.
overlapping_indices = df.index[np.argwhere(overlapping)].values

df.loc[
    np.unique(overlapping_indices.ravel()),
    'double_booked'
] = True

df.loc[
    overlapping_indices[:, 1],
    'Time_removed'
] = (
    np.minimum(df.loc[overlapping_indices[:, 0], 'End'], df.loc[overlapping_indices[:, 1], 'End'])
    - np.maximum(df.loc[overlapping_indices[:, 0], 'Start'], df.loc[overlapping_indices[:, 1], 'Start'])
).values

However from the example data it is not clear how you want to deal with marking overlapping meetings as double booked. For the 12:30:00 - 13:00:00 meetings you marked only the second one, while for the 13:00:00 - 16:00:00 and 14:30:00 - 15:30:00 you marked both as double booked.

Edit 2

In order to consider multiple (> 3) overlapping meetings we need to compute the overlap times for all pairs of meetings and then consider the maximum overlap for those that actually have a (positive) overlap. The following solution requires the data to be sorted by start time:

# This requires the data frame to be sorted by `Start` time.

start_times = np.tile(df['Start'].values, (len(df), 1))
end_times = np.tile(df['End'].values, (len(df), 1))
overlap_times = np.triu(np.minimum(end_times, end_times.T) - np.maximum(start_times, start_times.T))
overlap_times[np.diag_indices(len(overlap_times))] = np.timedelta64(0)
overlap_indices = df.index[np.argwhere(overlap_times > np.timedelta64(0))]
overlaps_others_indices = np.unique(overlap_indices[:, 1])

df.loc[
    np.unique(overlap_indices.ravel()),
    'double_booked'
] = True

df.loc[
    overlaps_others_indices,
    'Time_removed'
] = pd.Series(overlap_times.max(axis=0), index=df.index)[overlaps_others_indices]

Upvotes: 1

Related Questions