Md Mahmudul Hasan
Md Mahmudul Hasan

Reputation: 149

How to identify a sequence and index number before a particular sequence occurs for the first time

I have a dataframe in pandas, an example of which is provided below:

Person  appear_1   appear_2   appear_3   appear_4   appear_5  appear_6   
  A        1          0          0           1         0          1
  B        1          1          0           0         1          0 
  C        1          0          1           1         0          0
  D        1          1          0           1         0          0

As you can see 1 and 0 occurs randomly in different columns. It would be helpful, if anyone can suggest me a code in python such that I am able to count the number of times '1' occurs before the first occurrence of a 1, 0 and 0 in order. For example, for member A, the first double zero event occurs at appear_2 and appear_3, so the duration will be 1. Similarly for the member B, the first double zero event occurs at appear_3 and appear_4 so there are a total of two 1s that occur before this. So, the 1 included in 1,0,0 sequence is also considered during the count of total number of 1. it is because the 1 indicates that a person started the process, and 0,0 indicates his/her absence for two consecutive appearances after initiating the process. The resulting table should have a new column 'duration' something like this:

Person  appear_1   appear_2   appear_3   appear_4   appear_5  appear_6  duration   
  A        1          0          0           1         0          1         1
  B        1          1          0           0         1          0         2
  C        1          0          1           1         0          0         3
  D        1          1          1           1         0          0         4

Thank you in advance.

Upvotes: 2

Views: 77

Answers (2)

Andy L.
Andy L.

Reputation: 25259

My logic is checking the current position to next position. If they are both 0, the mask turns to True at that location. After that doing cumsum on axis=1. Locations are in front the first True will turn to 0 by cumsum. Finally, comparing mask to 0 to keep only positions appear before the double 0 and sum. To use this logic, I need to handle the case where double 0 are the first elements in row as in 'D', 0, 0, 1, 1, 0, 0. Your sample doesn't have this case. However, I expect the real data would have it.

cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)

df['duration'] = df1[m2.cumsum(1) == 0].sum(1)

Out[100]:
  Person  appear_1  appear_2  appear_3  appear_4  appear_5  appear_6  duration
0      A         1         0         0         1         0         1       1.0
1      B         1         1         0         0         1         0       2.0
2      C         1         0         1         1         0         0       3.0
3      D         1         1         1         1         0         0       4.0

Change your sample to have the special case where the first elements are 0

Update: add case E where all appear_x are 1.

Sample (df_n):

  Person  appear_1  appear_2  appear_3  appear_4  appear_5  appear_6
0      A         1         0         0         1         0         1
1      B         1         1         0         0         1         0
2      C         1         0         1         1         0         0
3      D         0         0         1         1         0         0
4      E         1         1         1         1         1         1

cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df_n[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)

df_n['duration'] = df1[m2.cumsum(1) == 0].sum(1)

Out[503]:
  Person  appear_1  appear_2  appear_3  appear_4  appear_5  appear_6  duration
0      A         1         0         0         1         0         1       1.0
1      B         1         1         0         0         1         0       2.0
2      C         1         0         1         1         0         0       3.0
3      D         0         0         1         1         0         0       2.0
4      E         1         1         1         1         1         1       6.0

Upvotes: 2

BENY
BENY

Reputation: 323316

A little logic here , first we use rolling sum find the value equal to 0 , then we just need to do cumprod, once it hit the 0, the prod will return 0, then we just need to sum all value not 0 for each row get the result

s=df.iloc[:,1:]
s1=s.rolling(2,axis=1,min_periods=1).sum().cumprod(axis=1)
s.mask(s1==0).sum(1)
Out[37]: 
0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64

Upvotes: 3

Related Questions