Reputation: 13
I have dataframe from pandas (import pandas as pd)
print(df)
C1 C2
0 0 0
1 0 0
2 1 0
3 1 1
4 1 1
5 0 1
6 0 0
7 0 0
8 0 0
9 1 0
10 1 1
11 1 1
12 0 1
13 0 0
I want count +1 in 'C3' after rising edges (rising edges start when C1 =1 and C2=0) I tried with iterrow()
count=1
df['C3'] = 0
for index, row in df.iterrows():
if (row.C1 == 1) and (row.C2 == 0):
count += 1
df.at[index, 'C3'] = count
else:
df.at[index, 'C3'] = count
print(df)
C1 C2 C3
0 0 0 1
1 0 0 1
2 1 0 2
3 1 1 2
4 1 1 2
5 0 1 2
6 0 0 2
7 0 0 2
8 0 0 2
9 1 0 3
10 1 1 3
11 1 1 3
12 0 1 3
13 0 0 3
for a dataframe with 300000 row, it's little bit slow, does it have a simple way to make it more faster?
Thanks a lot for your help!
Upvotes: 0
Views: 401
Reputation: 11171
You can:
counts
that is a boolean mask for the condition you want (counts
);C3
to the original df with value 1 + counts.cumsum()
Note: pandas joins the series to the dataframe based on index values, not order. Doing some intermediate operation that mangles the series of the df or counts
will produce unexpected results.
code:
counts = (df.C1 == 1) & (df.C2 == 0)
df["C3"] = 1 + counts.cumsum()
result:
C1 C2 C3
0 0 0 1
1 0 0 1
2 1 0 2
3 1 1 2
4 1 1 2
5 0 1 2
6 0 0 2
7 0 0 2
8 0 0 2
9 1 0 3
10 1 1 3
11 1 1 3
12 0 1 3
13 0 0 3
Lets compare the performance of three options: iterrows
, df.apply
, and the vectorized solution above:
df = pd.DataFrame(dict(C1=np.random.choice(2,size=100000), C2=np.random.choice(2,size=100000)))
df1 = df.copy(deep=True)
df2 = df.copy(deep=True)
df3 = df.copy(deep=True)
def use_iterrows():
count=1
df1['C3'] = 0
for index, row in df1.iterrows():
if (row.C1 == 1) and (row.C2 == 0):
count += 1
df1.at[index, 'C3'] = count
else:
df1.at[index, 'C3'] = count
def use_apply():
df2['C3'] = df2.apply(lambda x: x['C1']==1 and x['C2']==0, axis=1).cumsum()+1
def use_vectorized():
counts = (df3.C1 == 1) & (df3.C2 == 0)
df3["C3"] = 1 + counts.cumsum()
%timeit use_iterrows()
# 8.23 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit use_apply()
# 1.54 s ± 27.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit use_vectorized()
# 1.28 ms ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Summary: using vectorized functions is by far the fastest (~1000x(!) faster for a df of 100k rows). I like to keep in the habit of using vectorized solutions when possible. The strength of df.apply
is that it is very flexible and will work in situations that are difficult for a vectorized operation. I don't think I've ever needed iterrows.
Upvotes: 2
Reputation: 10883
Short answer:
df['C3'] = df.apply(lambda x: x['C1']==1 and x['C2']==0, axis=1).cumsum()+1
desired result:
C1 C2 C3
0 0 0 1
1 0 0 1
2 1 0 2
3 1 1 2
4 1 1 2
5 0 1 2
6 0 0 2
7 0 0 2
8 0 0 2
9 1 0 3
10 1 1 3
11 1 1 3
12 0 1 3
13 0 0 3
What you need to remember:
When it comes to long data do not use iterrows. It makes it significantly slow.
apply -- a much better alternative to iterrows and much more efficient
Upvotes: 0