GDC
GDC

Reputation: 17

Check if timestamps column values are in data range for each group

I would like to achieve the following with python pandas.

Suppose I have the following tables:

Events table

id event capture_date 
1  Up   '2021-01-10'  
1  Down '2021-03-25'
       ...  
1  Up   '2021-08-01'  
2  Up   '2021-06-02'  
2  Down '2021-09-15'  
2  Up   '2021-11-03' 
2  Down '2021-11-05'
       ...
10  Down'2020-06-02'  
10  Up  '2020-09-15'  
10  Down'2021-10-03' 
10  Up  '2021-11-09'

Activity table

id activity_date 
1  '2021-01-12'  
1  '2021-03-28'  
1  '2021-08-04'  
2  '2021-06-30'  
2  '2021-11-01'  
2  '2021-11-03'
10 '2020-01-03'  

For each id group, I would like to add a third column to the "Activity" table" that depends on the "capture_date" from the "Events" table. For example, for id =1: if the the activity_date falls between Up and Down events, then write 0, otherwise 1. And this should be done for every id group.

Thanks!

Upvotes: 0

Views: 231

Answers (1)

Riley
Riley

Reputation: 2271

This solution uses a package called staircase which is built on pandas and numpy for step functions. The data you have describes step functions. The functions step "Up" and "Down" as you have labelled it. We'll assume this means stepping up to a value of 1 and down to a value of 0, and then switch these values in line with your spec.

setup

import pandas as pd

events = pd.DataFrame(
    {
        "id":[1,1,1,2,2,2],
        "event":["Up", "Down", "Up", "Up", "Down", "Up"],
        "capture_date":["2021-01-10", "2021-03-25", "2021-08-01", "2021-06-02", "2021-09-15", "2021-11-03"],
    }
)

activities = pd.DataFrame(
    {
        "id":[1,1,1,2,2,2],
        "activity_date":["2021-01-12", "2021-03-28", "2021-08-04", "2021-06-30", "2021-11-01", "2021-11-03"],
    }
)

events["capture_date"] = pd.to_datetime(events["capture_date"])
activities["activity_date"] = pd.to_datetime(activities["activity_date"])

solution

A stepfunction is represented by the staircase.Stairs class. It takes vectors of start and end times (up and down in your case). These times do not need to be paired, or equal in count. If there is no matching "down" after an "up" then this simply means the step function will continue on with a value of 1 indefinitely as it heads towards infinity. . Likewise, if there is no matching "up" then the step function will have a value of 1 indefinitely as it heads towards negative infinity. We'll group the events dataframe by id and use the data to construct a step function for each id

import staircase as sc

stepfunctions = events.groupby("id").apply(lambda df: 
    sc.Stairs(
        start=df.query("event == 'Up'").capture_date,
        end=df.query("event == 'Down'").capture_date,
    )
)

stepfunctions is a pandas.Series, indexed by id and the values are staircase.Stairs objects

id
1    <staircase.Stairs, id=2209708361352>
2    <staircase.Stairs, id=2209708432264>
dtype: object

To handle the case where there are id values in events which do not appear in activities we'll manually add zero-valued step functions for these ids

for i in pd.Index(activities["id"]).difference(stepfunctions.index):
    stepfunctions[i] = sc.Stairs()

You can do many things with step functions, including visualising

stepfunctions[1].plot(style="hlines")

enter image description here

We can sample the step functions eg

stepfunctions[1].sample("2021-01-12", include_index=True)

which gives you

2021-01-12    1
dtype: int64

These step functions are currently 1 between up and down, and 0 between down and up. It seems like you want the opposite. This can be done by flipping these boolean valued step functions with staircase.Stairs.invert

Putting it all together, with the help of pandas.concat we have

result = (
    pd.concat(
        [stepfunctions[id].invert().sample(df["activity_date"], include_index=True) for id, df in activities.groupby("id")]
    )  # a pandas.Series
    .astype(int)  # convert integer floats
    .rename("Activity")  # renames the Series to your required name
    .reset_index() # turns activity date from index to column
    .assign(id=activities["id"])  # inserts id column
)

result will be a dataframe

  activity_date  Activity  id
0    2021-01-12         0   1
1    2021-03-28         1   1
2    2021-08-04         0   1
3    2021-06-30         0   2
4    2021-11-01         1   2
5    2021-11-03         0   2

note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.

Upvotes: 1

Related Questions