Reputation: 11192
I have a series that contains NaN and True as a value. I want another series to generate a sequence of number, such that whenever NaN comes put that series value as 0 and In between of Two NaN rows I need to perform cumcount.
i.e.,
Input:
colA
NaN
True
True
True
True
NaN
True
NaN
NaN
True
True
True
True
True
Output
ColA Sequence
NaN 0
True 0
True 1
True 2
True 3
NaN 0
True 0
NaN 0
NaN 0
True 0
True 1
True 2
True 3
True 4
How to perform this in pandas?
Upvotes: 9
Views: 1241
Reputation: 164783
Late to the party, but here's a numpy
solution wrapped in a function:
import pandas as pd, numpy as np
df = pd.DataFrame({'ColA': [np.nan, True, True, True, True, np.nan, True,
np.nan, np.nan, True, True, True, True, True]})
def return_cumsum(df):
v = np.array(df.ColA, dtype=float)
n = np.isnan(v)
v[n] = -np.diff(np.concatenate(([0.], np.cumsum(~n)[n])))
df['Sequence'] = np.array(np.maximum(0, np.cumsum(v)-1), dtype=int)
return df
df = return_cumsum(df)
# ColA Sequence
# 0 NaN 0
# 1 True 0
# 2 True 1
# 3 True 2
# 4 True 3
# 5 NaN 0
# 6 True 0
# 7 NaN 0
# 8 NaN 0
# 9 True 0
# 10 True 1
# 11 True 2
# 12 True 3
# 13 True 4
Upvotes: 2
Reputation: 29397
Try this:
df['Sequence']=df.groupby((df['colA'] != df['colA'].shift(1)).cumsum()).cumcount()
Full example:
>>> df = pd.DataFrame({'colA':[np.NaN,True,True,True,True,np.NaN,True,np.NaN,np.NaN,True,True,True,True,True]})
>>> df['Sequence']=df.groupby((df['colA'] != df['colA'].shift(1)).cumsum()).cumcount()
>>> df
colA Sequence
0 NaN 0
1 True 0
2 True 1
3 True 2
4 True 3
5 NaN 0
6 True 0
7 NaN 0
8 NaN 0
9 True 0
10 True 1
11 True 2
12 True 3
13 True 4
Upvotes: 3
Reputation: 863301
If performace is important better is not use groupby
for count consecutive True
s:
a = df['colA'].notnull()
b = a.cumsum()
df['Sequence'] = (b-b.mask(a).add(1).ffill().fillna(0).astype(int)).where(a, 0)
print (df)
colA Sequence
0 NaN 0
1 True 0
2 True 1
3 True 2
4 True 3
5 NaN 0
6 True 0
7 NaN 0
8 NaN 0
9 True 0
10 True 1
11 True 2
12 True 3
13 True 4
Explanation:
df = pd.DataFrame({'colA':[np.nan,True,True,True,True,np.nan,
True,np.nan,np.nan,True,True,True,True,True]})
a = df['colA'].notnull()
#cumulative sum, Trues are processes like 1
b = a.cumsum()
#replace Trues from a to NaNs
c = b.mask(a)
#add 1 for count from 0
d = b.mask(a).add(1)
#forward fill NaNs, replace possible first NaNs to 0 and cast to int
e = b.mask(a).add(1).ffill().fillna(0).astype(int)
#substract b for counts
f = b-b.mask(a).add(1).ffill().fillna(0).astype(int)
#replace -1 to 0 by mask a
g = (b-b.mask(a).add(1).ffill().fillna(0).astype(int)).where(a, 0)
#all together
df = pd.concat([a,b,c,d,e,f,g], axis=1, keys=list('abcdefg'))
print (df)
a b c d e f g
0 False 0 0.0 1.0 1 -1 0
1 True 1 NaN NaN 1 0 0
2 True 2 NaN NaN 1 1 1
3 True 3 NaN NaN 1 2 2
4 True 4 NaN NaN 1 3 3
5 False 4 4.0 5.0 5 -1 0
6 True 5 NaN NaN 5 0 0
7 False 5 5.0 6.0 6 -1 0
8 False 5 5.0 6.0 6 -1 0
9 True 6 NaN NaN 6 0 0
10 True 7 NaN NaN 6 1 1
11 True 8 NaN NaN 6 2 2
12 True 9 NaN NaN 6 3 3
13 True 10 NaN NaN 6 4 4
Upvotes: 11
Reputation: 402922
You could use groupby
+ cumcount
+ mask
here:
m = df.colA.isnull()
df['Sequence'] = df.groupby(m.cumsum()).cumcount().sub(1).mask(m, 0)
Or, use clip_lower
in the last step and you don't have to pre-cache m
:
df['Sequence'] = df.groupby(df.colA.isnull().cumsum()).cumcount().sub(1).clip_lower(0)
df
colA Sequence
0 NaN 0
1 True 0
2 True 1
3 True 2
4 True 3
5 NaN 0
6 True 0
7 NaN 0
8 NaN 0
9 True 0
10 True 1
11 True 2
12 True 3
13 True 4
Timings
df = pd.concat([df] * 10000, ignore_index=True)
# Timing the alternatives in this answer
%%timeit
m = df.colA.isnull()
df.groupby(m.cumsum()).cumcount().sub(1).mask(m, 0)
23.3 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df.groupby(df.colA.isnull().cumsum()).cumcount().sub(1).clip_lower(0)
24.1 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# @user2314737's solution
%%timeit
df.groupby((df['colA'] != df['colA'].shift(1)).cumsum()).cumcount()
29.8 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# @jezrael's solution
%%timeit
a = df['colA'].isnull()
b = a.cumsum()
(b-b.where(~a).add(1).ffill().fillna(0).astype(int)).clip_lower(0)
11.5 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Note, your mileage may vary, depending on the data.
Upvotes: 8