Reputation: 17
I would like to achieve the following with python pandas.
Suppose I have the following tables:
Events table
id event capture_date
1 Up '2021-01-10'
1 Down '2021-03-25'
...
1 Up '2021-08-01'
2 Up '2021-06-02'
2 Down '2021-09-15'
2 Up '2021-11-03'
2 Down '2021-11-05'
...
10 Down'2020-06-02'
10 Up '2020-09-15'
10 Down'2021-10-03'
10 Up '2021-11-09'
Activity table
id activity_date
1 '2021-01-12'
1 '2021-03-28'
1 '2021-08-04'
2 '2021-06-30'
2 '2021-11-01'
2 '2021-11-03'
10 '2020-01-03'
For each id group, I would like to add a third column to the "Activity" table" that depends on the "capture_date" from the "Events" table. For example, for id =1: if the the activity_date falls between Up and Down events, then write 0, otherwise 1. And this should be done for every id group.
Thanks!
Upvotes: 0
Views: 231
Reputation: 2271
This solution uses a package called staircase
which is built on pandas and numpy for step functions. The data you have describes step functions. The functions step "Up" and "Down" as you have labelled it. We'll assume this means stepping up to a value of 1 and down to a value of 0, and then switch these values in line with your spec.
setup
import pandas as pd
events = pd.DataFrame(
{
"id":[1,1,1,2,2,2],
"event":["Up", "Down", "Up", "Up", "Down", "Up"],
"capture_date":["2021-01-10", "2021-03-25", "2021-08-01", "2021-06-02", "2021-09-15", "2021-11-03"],
}
)
activities = pd.DataFrame(
{
"id":[1,1,1,2,2,2],
"activity_date":["2021-01-12", "2021-03-28", "2021-08-04", "2021-06-30", "2021-11-01", "2021-11-03"],
}
)
events["capture_date"] = pd.to_datetime(events["capture_date"])
activities["activity_date"] = pd.to_datetime(activities["activity_date"])
solution
A stepfunction is represented by the staircase.Stairs
class. It takes vectors of start and end times (up and down in your case). These times do not need to be paired, or equal in count. If there is no matching "down" after an "up" then this simply means the step function will continue on with a value of 1 indefinitely as it heads towards infinity. . Likewise, if there is no matching "up" then the step function will have a value of 1 indefinitely as it heads towards negative infinity. We'll group the events
dataframe by id
and use the data to construct a step function for each id
import staircase as sc
stepfunctions = events.groupby("id").apply(lambda df:
sc.Stairs(
start=df.query("event == 'Up'").capture_date,
end=df.query("event == 'Down'").capture_date,
)
)
stepfunctions
is a pandas.Series
, indexed by id
and the values are staircase.Stairs
objects
id
1 <staircase.Stairs, id=2209708361352>
2 <staircase.Stairs, id=2209708432264>
dtype: object
To handle the case where there are id
values in events
which do not appear in activities
we'll manually add zero-valued step functions for these ids
for i in pd.Index(activities["id"]).difference(stepfunctions.index):
stepfunctions[i] = sc.Stairs()
You can do many things with step functions, including visualising
stepfunctions[1].plot(style="hlines")
We can sample the step functions eg
stepfunctions[1].sample("2021-01-12", include_index=True)
which gives you
2021-01-12 1
dtype: int64
These step functions are currently 1 between up and down, and 0 between down and up. It seems like you want the opposite. This can be done by flipping these boolean valued step functions with staircase.Stairs.invert
Putting it all together, with the help of pandas.concat
we have
result = (
pd.concat(
[stepfunctions[id].invert().sample(df["activity_date"], include_index=True) for id, df in activities.groupby("id")]
) # a pandas.Series
.astype(int) # convert integer floats
.rename("Activity") # renames the Series to your required name
.reset_index() # turns activity date from index to column
.assign(id=activities["id"]) # inserts id column
)
result
will be a dataframe
activity_date Activity id
0 2021-01-12 0 1
1 2021-03-28 1 1
2 2021-08-04 0 1
3 2021-06-30 0 2
4 2021-11-01 1 2
5 2021-11-03 0 2
note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
Upvotes: 1