Reputation: 679
Suppose I have the following dataframe:
df = pd.DataFrame({'a':[0,0,0,1,0,0], 'b':[0,0,1,0,0,0], 'c':[0,1,1,0,0,0]})
df.index = pd.date_range('2000-03-02', periods=6, freq='D')
which looks like this:
a b c
2000-03-02 0 0 0
2000-03-03 0 0 1
2000-03-04 0 1 1
2000-03-05 1 0 0
2000-03-06 0 0 0
2000-03-07 0 0 0
And now I want to set to 2 each value in a given column that occurs after the last 1. The desired result looks like this:
a b c
2000-03-02 0 0 0
2000-03-03 0 0 1
2000-03-04 0 1 1
2000-03-05 2 2 2
2000-03-06 2 2 2
2000-03-07 2 2 2
I've got this code, which works:
cols = df.columns
for col in cols:
s = df[col]
x = s[s==1].index[-1]
df[col][(x + 1):] = 2
But it seems pretty awkward and contrary to the spirit of pandas (un-Pandonic?). Any suggestions for better approaches?
Upvotes: 2
Views: 3291
Reputation: 375445
One way is to replace
the lower zeros with NaNs:
In [11]: df.replace(0, np.nan).bfill() # maybe neater way to do this?
Out[11]:
a b c
2000-03-02 1 1 1
2000-03-03 1 1 1
2000-03-04 1 1 1
2000-03-05 1 NaN NaN
2000-03-06 NaN NaN NaN
2000-03-07 NaN NaN NaN
Now you can use where
to change these to 2:
In [12]: df.where(df.replace(0, np.nan).bfill(), 2)
Out[12]:
a b c
2000-03-02 0 0 0
2000-03-03 0 0 1
2000-03-04 0 1 1
2000-03-05 1 2 2
2000-03-06 2 2 2
2000-03-07 2 2 2
Edit: it may be faster to use a trick here with cumsum:
In [21]: %timeit df.where(df.replace(0, np.nan).bfill(), 2)
100 loops, best of 3: 2.34 ms per loop
In [22]: %timeit df.where(df[::-1].cumsum()[::-1], 2)
1000 loops, best of 3: 1.7 ms per loop
In [23]: %timeit pd.DataFrame(np.where(np.cumsum(df.values[::-1], 0)[::-1], df.values, 2), df.index)
10000 loops, best of 3: 186 µs per loop
Upvotes: 1
Reputation: 128948
This is a pretty general soln. (e.g. your will fail if the index is non-consecutive). The first part, getting the indexer is pretty pandastic!
In [64]: indexer = Series(df.index.get_indexer(df.diff().idxmin().values),index=df.columns)
In [65]: indexer
Out[65]:
a 4
b 3
c 3
dtype: int64
I think their is a vectorized way of doing this, all you have to do is construct the correct boolean matrix based on the above indexer, but making my brain hurt.
In [66]: def f(x):
x.iloc[indexer[x.name]:] = 2
return x
....:
In [67]: df.apply(f)
Out[67]:
a b c
2000-03-02 0 0 0
2000-03-03 0 0 1
2000-03-04 0 1 1
2000-03-05 1 2 2
2000-03-06 2 2 2
2000-03-07 2 2 2
[6 rows x 3 columns]
Upvotes: 1