Reputation: 33
I am hoping to get a few suggestions for the best method for this iterative process I have on a large (~10m rows) pandas dataframe. I am using Python 3.7. The dataframe has a structure like so:
start_time flag
0 2019-12-01 15:31:42 True
1 2019-12-01 15:32:37 False
2 2019-12-01 15:47:58 False
3 2019-12-02 09:19:13 True
4 2019-12-02 09:21:12 False
5 2019-12-02 09:24:04 False
6 2019-12-02 11:26:31 False
7 2019-12-02 12:10:56 False
8 2019-12-02 12:25:13 False
9 2019-12-02 13:09:25 False
What I would like to do is the following:
Every time the flag is true, then assign new columns start_day
and start_hour
as the day and hour values from the column start_time
. Every time the flag is false, then assign start_day
and start_hour
to the value in the previous index.
So the desired result is:
start_datetime flag start_day start_hour
0 2019-12-01 15:31:42 True 1 15
1 2019-12-01 15:32:37 False 1 15
2 2019-12-01 15:47:58 False 1 15
3 2019-12-02 09:19:13 True 2 9
4 2019-12-02 09:21:12 False 2 9
5 2019-12-02 09:24:04 False 2 9
6 2019-12-02 11:26:31 False 2 9
7 2019-12-02 12:10:56 False 2 9
8 2019-12-02 12:25:13 False 2 9
9 2019-12-02 13:09:25 False 2 9
For additional information, the flag
column is around 90% False, but the gaps between True values vary significantly.
This is what I have tried so far:
def startBuilder(df):
df.loc[df['flag'], 'start_day'] = 1
df.loc[~df['flag'], 'start_day'] = 0
df['start_day'] = df['start_day'] * pd.DatetimeIndex(df['start_datetime']).day
df.loc[df['flag'], 'start_hour'] = 1
df.loc[~df['flag'], 'start_hour'] = 0
df['start_hour'] = df['start_hour'] * pd.DatetimeIndex(df['start_datetime']).hour
for i, data in df.iterrows():
if not data['flag']:
df.at[i, 'start_year'] = df.at[i - 1, 'start_year']
df.at[i, 'start_month'] = df.at[i - 1, 'start_month']
df.at[i, 'start_day'] = df.at[i - 1, 'start_day']
df.at[i, 'start_hour'] = df.at[i - 1, 'start_hour']
return df
Can I stream/chunk this data into this function? If so, what would be the best way?
Can this function be optimised to take advantage of the fact that flag
is 90% false?
I have tried the following, but it is extremely slow (about 9x slower than the above):
def startBuilder2(df):
df.loc[df['flag'], 'start_day'] = 1
df.loc[~df['flag'], 'start_day'] = 0
df['start_day'] = df['start_day'] * pd.DatetimeIndex(df['start_datetime']).day
df.loc[df['flag'], 'start_hour'] = 1
df.loc[~df['flag'], 'start_hour'] = 0
df['start_hour'] = df['start_hour'] * pd.DatetimeIndex(df['start_datetime']).hour
idx1 = df[df['flag']].index
idx_pair = zip(idx1[:-1], idx1[1:])
for pair in idx_pair:
df.at[(~df['flag']) & (df['flag'].index.to_series().between(pair[0], pair[1])), 'start_day'] = df.at[pair[0], 'start_day']
df.at[(~df['flag']) & (df['flag'].index.to_series().between(pair[0], pair[1])), 'start_hour'] = df.at[pair[0], 'start_hour']
return df
Apologies for the huge post, just want to provide as much context as possible. Thanks in advance.
Upvotes: 1
Views: 43
Reputation: 109546
You probably don't need to chunk. See if this works for you.
df['start_day'] = np.nan
df['start_hour'] = np.nan
mask = df['flag'].eq(True)
df.loc[mask, 'start_day'] = df.loc[mask, 'start_time'].dt.day
df.loc[mask, 'start_hour'] = df.loc[mask, 'start_time'].dt.hour
df.loc[:, ['start_day', 'start_hour']] = df.loc[:, ['start_day', 'start_hour']].ffill()
Upvotes: 1