Reputation: 1527
I have data about machine failures. The data is in a pandas dataframe with date
, id
, failure
and previous_30_days
columns. The previous_30_days
column is currently all zeros. My desired outcome is to populate rows in the previous_30_days
column with a '1' if they occur within a 30 day time-span before a failure. I am currently able to do this with the following code:
failure_df = df[(df['failure'] == 1)] # create a dataframe of just failures
for index, row in failure_df.iterrows():
df.loc[(df['date'] >= (row.date - datetime.timedelta(days=30))) &
(df['date'] <= row.date) & (df['id'] == row.id), 'previous_30_days'] = 1
Note that I also check for the id match, because dates are repeated in the dataframe, so I cannot assume it is simply the previous 30 rows.
My code works, but the problem is that the dataframe is millions of rows, and this code is too slow at the moment.
Is there a more efficient way to achieve the desired outcome? Any thoughts would be very much appreciated.
Upvotes: 1
Views: 77
Reputation: 30424
I'm a little confused about how your code works (or is supposed to work), but this ought to point you in the right direction and can be easily adapted. It will be much faster by avoiding iterrows
in favor of vectorized operations (about 7x faster for this small dataframe, it should be a much bigger improvement on your large dataframe).
np.random.seed(123)
df=pd.DataFrame({ 'date':np.random.choice(pd.date_range('2015-1-1',periods=300),20),
'id':np.random.randint(1,4,20) })
df=df.sort(['id','date'])
Now, calculate days between current and previous date (by id).
df['since_last'] = df.groupby('id')['date'].apply( lambda x: x - x.shift() )
Then create your new column based on the number of days to the previous date.
df['previous_30_days'] = df['since_last'] < datetime.timedelta(days=30)
date id since_last previous_30_days
12 2015-02-17 1 NaT False
6 2015-02-27 1 10 days True
3 2015-03-25 1 26 days True
0 2015-04-09 1 15 days True
10 2015-04-24 1 15 days True
5 2015-05-04 1 10 days True
11 2015-05-07 1 3 days True
8 2015-08-14 1 99 days False
14 2015-02-02 2 NaT False
9 2015-04-07 2 64 days False
19 2015-07-28 2 112 days False
7 2015-08-03 2 6 days True
15 2015-08-13 2 10 days True
1 2015-08-19 2 6 days True
2 2015-01-18 3 NaT False
13 2015-03-15 3 56 days False
18 2015-04-07 3 23 days True
4 2015-04-17 3 10 days True
16 2015-04-22 3 5 days True
17 2015-09-11 3 142 days False
Upvotes: 1