greycat90
greycat90

Reputation: 55

Pandas iterate over rows and conditional count

I am trying to iterate over rows in a pandas Dataframe with a conditional count in a new column called Stage. For each name the stage should start at 1, and if the name is the same between rows then after a "Healthy" status a new stage should start. A "Healthy" event will be in the same stage as the preceding "Sick" events, if they exist. I've done the code in excel before but not sure how to do it in python.

What I have now is:

Date Name Status
2020-01-02 Mary Healthy
2020-01-05 Mary Sick
2020-01-15 Mary Sick
2020-01-20 Mary Healthy
2020-02-03 Mary Healthy
2020-02-06 Mary Sick
2020-02-10 Mary Sick
2020-02-15 Mary Healthy
2020-01-02 Bob Healthy
2020-01-05 Bob Healthy
2020-01-15 Bob Healthy
2020-01-20 Bob Sick
2020-02-03 Bob Sick
2020-02-06 Bob Sick
2020-02-10 Bob Sick
2020-02-15 Bob Healthy

What I would like to have:

Date Name Status Stage
2020-01-02 Mary Healthy 1
2020-01-05 Mary Sick 2
2020-01-15 Mary Sick 2
2020-01-20 Mary Healthy 2
2020-02-03 Mary Healthy 3
2020-02-06 Mary Sick 4
2020-02-10 Mary Sick 4
2020-02-15 Mary Healthy 4
2020-01-02 Bob Healthy 1
2020-01-05 Bob Healthy 2
2020-01-15 Bob Healthy 3
2020-01-20 Bob Sick 4
2020-02-03 Bob Sick 4
2020-02-06 Bob Sick 4
2020-02-10 Bob Sick 4
2020-02-15 Bob Healthy 4

Upvotes: 2

Views: 531

Answers (1)

Paul H
Paul H

Reputation: 68186

You don't need an explicit loop. You need the following:

  • group by the name column
  • apply to each group:
    • shift the Status column to look at the previous value
    • take cumulative sum of the following series:
      • if the previous value is null and current value is Healthy, we're at the first row so call it one
      • if the previous row is Healthy, call it one
      • otherwise, call it zero

from io import StringIO

import numpy
import pandas

df = pandas.read_csv(StringIO("""\
|Date|Name|Stage|
|2020-01-02|Mary|Healthy|
|2020-01-05|Mary|Sick|
|2020-01-15|Mary|Sick|
|2020-01-20|Mary|Healthy|
|2020-02-03|Mary|Healthy|
|2020-02-06|Mary|Sick|
|2020-02-10|Mary|Sick |
|2020-02-15|Mary|Healthy|
|2020-01-02|Bob|Healthy|
|2020-01-05|Bob|Healthy|
|2020-01-15|Bob|Healthy|
|2020-01-20|Bob|Sick|
|2020-02-03|Bob|Sick|
|2020-02-06|Bob|Sick|
|2020-02-10|Bob|Sick |
|2020-02-15|Bob|Healthy|
"""), sep='|').loc[:, ['Date', 'Name', 'Stage']]

output = (
    df.assign(Status=lambda df: df.groupby('Name')['Stage'].apply(lambda g: 
        numpy.bitwise_or(  #  returns 1 if either two conditions are met
            g.shift().eq('Healthy'),  # general case
            g.shift().isnull() & g.eq("Healthy") #  handles first row of a group
        ).cumsum()
    ))
)

print(output.to_string())

And I get:

          Date  Name    Stage  Status
0   2020-01-02  Mary  Healthy       1
1   2020-01-05  Mary     Sick       2
2   2020-01-15  Mary     Sick       2
3   2020-01-20  Mary  Healthy       2
4   2020-02-03  Mary  Healthy       3
5   2020-02-06  Mary     Sick       4
6   2020-02-10  Mary    Sick        4
7   2020-02-15  Mary  Healthy       4
8   2020-01-02   Bob  Healthy       1
9   2020-01-05   Bob  Healthy       2
10  2020-01-15   Bob  Healthy       3
11  2020-01-20   Bob     Sick       4
12  2020-02-03   Bob     Sick       4
13  2020-02-06   Bob     Sick       4
14  2020-02-10   Bob    Sick        4
15  2020-02-15   Bob  Healthy       4

Upvotes: 2

Related Questions