Reputation: 149
I have a dataframe in pandas, an example of which is provided below:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6
A 1 0 0 1 0 1
B 1 1 0 0 1 0
C 1 0 1 1 0 0
D 1 1 0 1 0 0
As you can see 1 and 0 occurs randomly in different columns. It would be helpful, if anyone can suggest me a code in python such that I am able to count the number of times '1' occurs before the first occurrence of a 1, 0 and 0 in order. For example, for member A, the first double zero event occurs at appear_2 and appear_3, so the duration will be 1. Similarly for the member B, the first double zero event occurs at appear_3 and appear_4 so there are a total of two 1s that occur before this. So, the 1 included in 1,0,0 sequence is also considered during the count of total number of 1. it is because the 1 indicates that a person started the process, and 0,0 indicates his/her absence for two consecutive appearances after initiating the process. The resulting table should have a new column 'duration' something like this:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 duration
A 1 0 0 1 0 1 1
B 1 1 0 0 1 0 2
C 1 0 1 1 0 0 3
D 1 1 1 1 0 0 4
Thank you in advance.
Upvotes: 2
Views: 77
Reputation: 25259
My logic is checking the current position to next position. If they are both 0
, the mask turns to True
at that location. After that doing cumsum
on axis=1. Locations are in front the first True
will turn to 0 by cumsum
. Finally, comparing mask to 0
to keep only positions appear before the double 0
and sum. To use this logic, I need to handle the case where double 0
are the first elements in row as in 'D', 0, 0, 1, 1, 0, 0
. Your sample doesn't have this case. However, I expect the real data would have it.
cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)
df['duration'] = df1[m2.cumsum(1) == 0].sum(1)
Out[100]:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 duration
0 A 1 0 0 1 0 1 1.0
1 B 1 1 0 0 1 0 2.0
2 C 1 0 1 1 0 0 3.0
3 D 1 1 1 1 0 0 4.0
Change your sample to have the special case where the first elements are 0
Update: add case E
where all appear_x
are 1
.
Sample (df_n
):
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6
0 A 1 0 0 1 0 1
1 B 1 1 0 0 1 0
2 C 1 0 1 1 0 0
3 D 0 0 1 1 0 0
4 E 1 1 1 1 1 1
cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df_n[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)
df_n['duration'] = df1[m2.cumsum(1) == 0].sum(1)
Out[503]:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 duration
0 A 1 0 0 1 0 1 1.0
1 B 1 1 0 0 1 0 2.0
2 C 1 0 1 1 0 0 3.0
3 D 0 0 1 1 0 0 2.0
4 E 1 1 1 1 1 1 6.0
Upvotes: 2
Reputation: 323316
A little logic here , first we use rolling
sum
find the value equal to 0 , then we just need to do cumprod
, once it hit the 0, the prod
will return 0, then we just need to sum
all value not 0 for each row get the result
s=df.iloc[:,1:]
s1=s.rolling(2,axis=1,min_periods=1).sum().cumprod(axis=1)
s.mask(s1==0).sum(1)
Out[37]:
0 1.0
1 2.0
2 3.0
3 4.0
dtype: float64
Upvotes: 3