Reputation: 55
I am trying to iterate over rows in a pandas Dataframe with a conditional count in a new column called Stage. For each name the stage should start at 1, and if the name is the same between rows then after a "Healthy" status a new stage should start. A "Healthy" event will be in the same stage as the preceding "Sick" events, if they exist. I've done the code in excel before but not sure how to do it in python.
What I have now is:
Date | Name | Status |
---|---|---|
2020-01-02 | Mary | Healthy |
2020-01-05 | Mary | Sick |
2020-01-15 | Mary | Sick |
2020-01-20 | Mary | Healthy |
2020-02-03 | Mary | Healthy |
2020-02-06 | Mary | Sick |
2020-02-10 | Mary | Sick |
2020-02-15 | Mary | Healthy |
2020-01-02 | Bob | Healthy |
2020-01-05 | Bob | Healthy |
2020-01-15 | Bob | Healthy |
2020-01-20 | Bob | Sick |
2020-02-03 | Bob | Sick |
2020-02-06 | Bob | Sick |
2020-02-10 | Bob | Sick |
2020-02-15 | Bob | Healthy |
What I would like to have:
Date | Name | Status | Stage |
---|---|---|---|
2020-01-02 | Mary | Healthy | 1 |
2020-01-05 | Mary | Sick | 2 |
2020-01-15 | Mary | Sick | 2 |
2020-01-20 | Mary | Healthy | 2 |
2020-02-03 | Mary | Healthy | 3 |
2020-02-06 | Mary | Sick | 4 |
2020-02-10 | Mary | Sick | 4 |
2020-02-15 | Mary | Healthy | 4 |
2020-01-02 | Bob | Healthy | 1 |
2020-01-05 | Bob | Healthy | 2 |
2020-01-15 | Bob | Healthy | 3 |
2020-01-20 | Bob | Sick | 4 |
2020-02-03 | Bob | Sick | 4 |
2020-02-06 | Bob | Sick | 4 |
2020-02-10 | Bob | Sick | 4 |
2020-02-15 | Bob | Healthy | 4 |
Upvotes: 2
Views: 531
Reputation: 68186
You don't need an explicit loop. You need the following:
from io import StringIO
import numpy
import pandas
df = pandas.read_csv(StringIO("""\
|Date|Name|Stage|
|2020-01-02|Mary|Healthy|
|2020-01-05|Mary|Sick|
|2020-01-15|Mary|Sick|
|2020-01-20|Mary|Healthy|
|2020-02-03|Mary|Healthy|
|2020-02-06|Mary|Sick|
|2020-02-10|Mary|Sick |
|2020-02-15|Mary|Healthy|
|2020-01-02|Bob|Healthy|
|2020-01-05|Bob|Healthy|
|2020-01-15|Bob|Healthy|
|2020-01-20|Bob|Sick|
|2020-02-03|Bob|Sick|
|2020-02-06|Bob|Sick|
|2020-02-10|Bob|Sick |
|2020-02-15|Bob|Healthy|
"""), sep='|').loc[:, ['Date', 'Name', 'Stage']]
output = (
df.assign(Status=lambda df: df.groupby('Name')['Stage'].apply(lambda g:
numpy.bitwise_or( # returns 1 if either two conditions are met
g.shift().eq('Healthy'), # general case
g.shift().isnull() & g.eq("Healthy") # handles first row of a group
).cumsum()
))
)
print(output.to_string())
And I get:
Date Name Stage Status
0 2020-01-02 Mary Healthy 1
1 2020-01-05 Mary Sick 2
2 2020-01-15 Mary Sick 2
3 2020-01-20 Mary Healthy 2
4 2020-02-03 Mary Healthy 3
5 2020-02-06 Mary Sick 4
6 2020-02-10 Mary Sick 4
7 2020-02-15 Mary Healthy 4
8 2020-01-02 Bob Healthy 1
9 2020-01-05 Bob Healthy 2
10 2020-01-15 Bob Healthy 3
11 2020-01-20 Bob Sick 4
12 2020-02-03 Bob Sick 4
13 2020-02-06 Bob Sick 4
14 2020-02-10 Bob Sick 4
15 2020-02-15 Bob Healthy 4
Upvotes: 2