Reputation: 735
i have a dataframe as below.
test = pd.DataFrame({'col1':[0,0,1,0,0,0,1,2,0], 'col2': [0,0,1,2,3,0,0,0,0]})
col1 col2
0 0 0
1 0 0
2 1 1
3 0 2
4 0 3
5 0 0
6 1 0
7 2 0
8 0 0
For each column, i want to find the index of value 1 before the maximum of each column. For example, for the first column, the max is 2, the index of value 1 before 2 is 6. for the second column, the max is 3, the index of value 1 before the value 3 is 2.
In summary, I am looking to get [6, 2] as the output for this test DataFrame. Is there a quick way to achieve this?
Upvotes: 6
Views: 1129
Reputation: 294546
t = test.to_numpy()
a = t.argmax(0)
i, j = np.where(t == 1)
mask = i <= a[j]
i = i[mask]
j = j[mask]
b = np.empty_like(a)
b.fill(-1)
np.maximum.at(b, j, i)
pd.Series(b, test.columns)
col1 6
col2 2
dtype: int64
apply
test.apply(lambda s: max(s.index, key=lambda x: (s[x] == 1, s[x] <= s.max(), x)))
col1 6
col2 2
dtype: int64
cummax
test.eq(1).where(test.cummax().lt(test.max())).iloc[::-1].idxmax()
col1 6
col2 2
dtype: int64
I just wanted to use a new tool and do some bechmarking see this post
r.to_pandas_dataframe().T
10 31 100 316 1000 3162 10000
al_0 0.003696 0.003718 0.005512 0.006210 0.010973 0.007764 0.012008
wb_0 0.003348 0.003334 0.003913 0.003935 0.004583 0.004757 0.006096
qh_0 0.002279 0.002265 0.002571 0.002643 0.002927 0.003070 0.003987
sb_0 0.002235 0.002246 0.003072 0.003357 0.004136 0.004083 0.005286
sb_1 0.001771 0.001779 0.002331 0.002353 0.002914 0.002936 0.003619
cs_0 0.005742 0.005751 0.006748 0.006808 0.007845 0.008088 0.009898
cs_1 0.004034 0.004045 0.004871 0.004898 0.005769 0.005997 0.007338
pr_0 0.002484 0.006142 0.027101 0.085944 0.374629 1.292556 6.220875
pr_1 0.003388 0.003414 0.003981 0.004027 0.004658 0.004929 0.006390
pr_2 0.000087 0.000088 0.000089 0.000093 0.000107 0.000145 0.000300
fig = plt.figure(figsize=(10, 10))
ax = plt.subplot()
r.plot(ax=ax)
from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()
def al_0(test): return test.apply(lambda x: x.where(x[:x.idxmax()].eq(1)).drop_duplicates(keep='last').idxmin())
def wb_0(df): return (df.iloc[::-1].cummax().eq(df.max())&df.eq(1).iloc[::-1]).idxmax()
def qh_0(test): return (test.eq(1) & (test.index.values[:,None] < test.idxmax().values)).cumsum().idxmax()
def sb_0(test): return test.apply(lambda x: x[:x.idxmax()].eq(1)[lambda i:i].last_valid_index())
def sb_1(test): return test.apply(lambda x: x[:x.idxmax()].eq(1).cumsum().idxmax())
def cs_0(test): return (lambda m: test.mask(m).apply(pd.Series.last_valid_index))(test.eq(test.max()).cumsum().gt(0) | test.ne(1))
def cs_1(test): return pd.Series((test.eq(1) & test.eq(test.max()).cumsum().eq(0)).values.cumsum(axis=0).argmax(axis=0), test.columns)
def pr_0(test): return test.apply(lambda s: max(s.index, key=lambda x: (s[x] == 1, s[x] <= s.max(), x)))
def pr_1(test): return test.eq(1).where(test.cummax().lt(test.max())).iloc[::-1].idxmax()
def pr_2(test):
t = test.to_numpy()
a = t.argmax(0)
i, j = np.where(t == 1)
mask = i <= a[j]
i = i[mask]
j = j[mask]
b = np.empty_like(a)
b.fill(-1)
np.maximum.at(b, j, i)
return pd.Series(b, test.columns)
import math
def gen_test(n):
a = np.random.randint(100, size=(n, int(math.log10(n)) + 1))
idx = a.argmax(0)
while (idx == 0).any():
a = np.random.randint(100, size=(n, int(math.log10(n)) + 1))
idx = a.argmax(0)
for j, i in enumerate(idx):
a[np.random.randint(i), j] = 1
return pd.DataFrame(a).add_prefix('col')
@b.add_arguments('DataFrame Size')
def argument_provider():
for exponent in np.linspace(1, 3, 5):
size = int(10 ** exponent)
yield size, gen_test(size)
b.add_functions([al_0, wb_0, qh_0, sb_0, sb_1, cs_0, cs_1, pr_0, pr_1, pr_2])
r = b.run()
Upvotes: 4
Reputation: 25269
I would use dropna
with where
to drop duplicated 1
keeping the last 1
, and call idxmin
on it.
test.apply(lambda x: x.where(x[:x.idxmax()].eq(1)).drop_duplicates(keep='last').idxmin())
Out[1433]:
col1 6
col2 2
dtype: int64
Upvotes: 1
Reputation: 323396
A little bit logic here
(df.iloc[::-1].cummax().eq(df.max())&df.eq(1).iloc[::-1]).idxmax()
Out[187]:
col1 6
col2 2
dtype: int64
Upvotes: 3
Reputation: 150825
Here's a mixed numpy
and pandas
solution:
(test.eq(1) & (test.index.values[:,None] < test.idxmax().values)).cumsum().idxmax()
which is a bit faster than the other solutions.
Upvotes: 2
Reputation: 153550
Using @cs95 idea of last_valid_index
:
test.apply(lambda x: x[:x.idxmax()].eq(1)[lambda i:i].last_valid_index())
Output:
col1 6
col2 2
dtype: int64
Expained:
Using index slicing to cut each column to start to max value, then look for the values that are equal to one and find the index of the last true value.
Or as @QuangHoang suggests:
test.apply(lambda x: x[:x.idxmax()].eq(1).cumsum().idxmax())
Upvotes: 4
Reputation: 403218
Use Series.mask
to hide elements that aren't 1, then apply Series.last_valid_index
to each column.
m = test.eq(test.max()).cumsum().gt(0) | test.ne(1)
test.mask(m).apply(pd.Series.last_valid_index)
col1 6
col2 2
dtype: int64
Using numpy to vectorize, you can use numpy.cumsum
and argmax
:
idx = ((test.eq(1) & test.eq(test.max()).cumsum().eq(0))
.values
.cumsum(axis=0)
.argmax(axis=0))
idx
# array([6, 2])
pd.Series(idx, index=[*test])
col1 6
col2 2
dtype: int64
Upvotes: 5