Reputation: 418
I have a problem that I can't solve for a long time. I want to generate new DataFrame with interpolated values (every one minute), and N additional columns, where N is length of set from ID column in base DataFrame.
My DataFrame (df) looks like this:
Datetime_ON Datetime_OFF ID
2020-01-02 15:21:49 2020-01-02 15:32:44 GB5U
2020-01-02 15:22:40 2020-01-02 15:24:14 GS8U
2020-01-02 15:45:25 2020-01-02 15:59:18 GS8U
2020-01-02 16:07:11 2020-01-02 16:16:50 GB5U
Now I want to create a new DataFrame (new_df) with interpolated values with datetime range from min(df['Datetime_ON'])
to max(df['Datetime_OFF'])
new_df = pd.DataFrame(pd.date_range(start=min(df['Datetime_ON']), end=max(df['Datetime_OFF']), freq="1min"))
Datetime
2020-01-02 15:21:00
2020-01-02 15:22:00
2020-01-02 15:23:00
...
2020-01-02 16:16:00
Now I want to add N columns from set of ID column from df DrataFrame set(df['ID'])
and put in that columns True/False or 1/0 value based on condition if Datetime column from new_df is between Datetime_ON and Datetime_OFF from df in any row with ID equal column name in new_df
My expected output:
Datetime GB5U GS8U
2020-01-02 15:21:00 False False
2020-01-02 15:22:00 True False
2020-01-02 15:23:00 True True
2020-01-02 15:24:00 True True
2020-01-02 15:25:00 True False
...
2020-01-02 16:15:00 True False
2020-01-02 16:16:00 True False
The problem is that my dataframe has over 700k rows, and length of ID set is more than 100 (more than 100 new columns). I tried to use iterrows() and apply function but it is extremely slow. Is there a way to solve that problem in reasonable time?
Upvotes: 0
Views: 400
Reputation: 11321
Here's a suggestion. I'm not sure if it's actually faster than what you've already tried, but maybe worth a try:
from functools import reduce
from operator import or_
# Just to make new_df as in the example
new_df = pd.DataFrame(pd.date_range('2020-01-02 15:21:00',
'2020-01-02 16:16:00',
freq="1min"),
columns=['Datetime'])
# To make the following statement more compact
df.rename(columns={'Datetime_ON': 'ON', 'Datetime_OFF': 'OFF'}, inplace=True)
new_df = pd.concat(
[new_df]
+ [
(reduce(or_,((start <= new_df) & (new_df <= stop)
for start, stop in zip(group['ON'], group['OFF'])))
.rename({'Datetime': key}, axis='columns'))
for key, group in df.groupby('ID')
],
axis='columns'
).set_index('Datetime', drop=True)
Result:
GB5U GS8U
Datetime
2020-01-02 15:21:00 False False
2020-01-02 15:22:00 True False
2020-01-02 15:23:00 True True
2020-01-02 15:24:00 True True
2020-01-02 15:25:00 True False
...
2020-01-02 16:15:00 True False
2020-01-02 16:16:00 True False
I thought a bit more about this problem and might have found a faster way. The idea is to set the True
values more straight forward:
# new_df as in the example
new_df = pd.DataFrame(pd.date_range('2020-01-02 15:21:00',
'2020-01-02 16:16:00',
freq="1min"),
columns=['Datetime'])
# To make the following statement more compact
df.rename(columns={'Datetime_ON': 'ON', 'Datetime_OFF': 'OFF'}, inplace=True)
# To make sure the columns are in datetime format
df.ON = pd.to_datetime(df.ON)
df.OFF = pd.to_datetime(df.OFF)
idx = new_df.Datetime
dfs = []
for key, group in df.groupby('ID'):
trues = [
minute
for start, stop in zip(group.ON, group.OFF)
for minute in pd.date_range(start.ceil('min'), stop.floor('min'), freq='1min')
]
df_group = pd.DataFrame({key: False}, index=idx)
df_group.loc[trues, key] = True
dfs.append(df_group)
new_df = pd.concat(dfs, axis='columns')
For the example the results are identical. And I think that should be the case in general, but I'm not yet a 100% sure. And, again, I'm also not sure about the efficiency because a lot depends on the real data.
Upvotes: 1