Reputation: 1723
I'm looking for the best way of solving the following problem in Pandas. Given a dataframe df
, e.g.
import pandas as pd
dt = pd.date_range('1/1/2014', periods=10, freq='H')
df = pd.DataFrame({'A':[0.,0.,0.,1.,0.,0.,1.,0.,0.,0.],'B':[0.,0.,0.,0.,1.,0.,0.,0.,0.,1.]},index=dt)
df
A B
2014-01-01 00:00:00 0 0
2014-01-01 01:00:00 0 0
2014-01-01 02:00:00 0 0
2014-01-01 03:00:00 1 0
2014-01-01 04:00:00 0 1
2014-01-01 05:00:00 0 0
2014-01-01 06:00:00 1 0
2014-01-01 07:00:00 0 0
2014-01-01 08:00:00 0 0
2014-01-01 09:00:00 0 1
then for each row where B=1
I would like to find the index of the last row before that where A=1
, i.e.
A B C
2014-01-01 04:00:00 0 1 2014-01-01 03:00:00
2014-01-01 09:00:00 0 1 2014-01-01 06:00:00
Not sure what the easiest way is to do this on Pandas while also being reasonably efficient for large dataframes.
Update:
Should have made clear in my original question that I'd like this to work where there is potentially more than one A=1
per B=1
.
Upvotes: 0
Views: 472
Reputation: 1396
This is what I would do. Easy to understand and readable
import pandas as pd
dt = pd.date_range('1/1/2014', periods=10, freq='H')
df = pd.DataFrame({'A':[0.,0.,0.,1.,0.,0.,1.,0.,0.,0.],
'B':[0.,0.,0.,0.,1.,0.,0.,0.,0.,1.]},
index=dt)
# How I would do it
temp = np.where(df['B']==1)[0]
newdf = df.iloc[temp]
temp2 = np.where(df['A']==1)[0]
newdf['C'] = df.iloc[temp2].index
which prints
In [8]: newdf
Out[8]:
A B C
2014-01-01 04:00:00 0 1 2014-01-01 03:00:00
2014-01-01 09:00:00 0 1 2014-01-01 08:00:00
Upvotes: 0
Reputation: 16488
Notice that the answer from @jeanrjc is not stable if you have more than one A
value per B
splitted group. To show this, I manipulate the given dataframe:
In[217]: df.iloc[2]['A'] = 1
In[218]: df
Out[218]:
A B
2014-01-01 00:00:00 0 0
2014-01-01 01:00:00 0 0
2014-01-01 02:00:00 1 0
2014-01-01 03:00:00 1 0
2014-01-01 04:00:00 0 1
2014-01-01 05:00:00 0 0
2014-01-01 06:00:00 1 0
2014-01-01 07:00:00 0 0
2014-01-01 08:00:00 0 0
2014-01-01 09:00:00 0 1
He proceeds to take the timestamp from where ever df.A == 1
- if that happens more than once per subgroup, you have an error.
In[219]: new_df["C"] = df.index[np.where(df.A==1)[0]]
Traceback (...)
ValueError: Length of values does not match length of index
Instead, I create an array that collects the correct indices. If you want to understand what happens, try to understand it from the outside (first, what is df.loc[:b]
given my for loop, then the next sub selection etc).
In[239]: new_df['C'] = [df.loc[:b][df.loc[:b].A == 1].iloc[-1].name for b in df.loc[df.B == 1].index ]
In[240]: new_df
Out[240]:
A B C
2014-01-01 04:00:00 0 1 2014-01-01 03:00:00
2014-01-01 09:00:00 0 1 2014-01-01 06:00:00
Upvotes: 1
Reputation: 21873
I would do like this:
new_df = df[df.B==1]
new_df["C"] = df.index[np.where(df.A==1)[0]]
new_df
A B C
2014-01-01 04:00:00 0 1 2014-01-01 03:00:00
2014-01-01 09:00:00 0 1 2014-01-01 06:00:00
Upvotes: 0
Reputation: 52256
Here's one idea - there may be something better.
Populate a a column C
with index values where A
is 1.
df['C'] = np.where(df['A'] == 1, df.index, '')
Forward fill the missing values, so it represents that last time A
was one.
df['C'] = df['C'].fillna(method='ffill')
Finally, fill all the values where B
is not 1 with missing values.
df['C'] = np.where(df['B'] == 1, df['C'], '')
In [108]: df
Out[108]:
A B C
2014-01-01 00:00:00 0 0 NaT
2014-01-01 01:00:00 0 0 NaT
2014-01-01 02:00:00 0 0 NaT
2014-01-01 03:00:00 1 0 NaT
2014-01-01 04:00:00 0 1 2014-01-01 03:00:00
2014-01-01 05:00:00 0 0 NaT
2014-01-01 06:00:00 1 0 NaT
2014-01-01 07:00:00 0 0 NaT
2014-01-01 08:00:00 0 0 NaT
2014-01-01 09:00:00 0 1 2014-01-01 06:00:00
Upvotes: 1