Reputation: 27
Let's say I have the following DataFrame in Pandas
date | customer | attended |
---|---|---|
2022-01-01 | John | True |
2022-01-02 | John | True |
2022-01-04 | John | True |
2022-01-05 | Mark | True |
what transformations could I do to fill in the missing gaps within the dates on a given frequency (daily in this case, but it can by any other) so it ends up like this:
date | customer | attended |
---|---|---|
2022-01-01 | John | True |
2022-01-01 | Mark | False |
2022-01-02 | John | True |
2022-01-02 | Mark | False |
2022-01-03 | John | False |
2022-01-03 | Mark | False |
2022-01-04 | John | True |
2022-01-04 | Mark | False |
2022-01-05 | John | False |
2022-01-05 | Mark | True |
This can be done on an individual level by filtering on just one customer and doing an outer join with another DataFrame that has all the dates, and it will fill the empty ones with NaNs, but I can't do that with all the different people at customer which is what I need to do. I'd appreciate an approach that's as computationally efficient as possible and doesn't involve a lot of convoluted iterations over the dataset.
Upvotes: 1
Views: 338
Reputation: 28644
One option is with the complete function from pyjanitor, to expose the explicitly missing rows:
# pip install pyjanitor
import pandas as pd
import janitor
# generate new dates
new_dates = {"date": pd.date_range(df.date.min(), df.date.max(), freq="D")}
df.complete(new_dates, 'customer', sort = True).fillna({'attended':False})
date customer attended
0 2022-01-01 John True
1 2022-01-01 Mark False
2 2022-01-02 John True
3 2022-01-02 Mark False
4 2022-01-03 John False
5 2022-01-03 Mark False
6 2022-01-04 John True
7 2022-01-04 Mark False
8 2022-01-05 John False
9 2022-01-05 Mark True
Upvotes: 0
Reputation:
Let's try it with pivot
+ date_range
+ reindex
+ stack
:
tmp = df.pivot('date','customer','attended')
tmp.index = pd.to_datetime(tmp.index)
out = tmp.reindex(pd.date_range(tmp.index[0], tmp.index[-1])).fillna(False).stack().reset_index().rename(columns={0:'attended'})
Output:
level_0 customer attended
0 2022-01-01 John True
1 2022-01-01 Mark False
2 2022-01-02 John True
3 2022-01-02 Mark False
4 2022-01-03 John False
5 2022-01-03 Mark False
6 2022-01-04 John True
7 2022-01-04 Mark False
8 2022-01-05 John False
9 2022-01-05 Mark True
Upvotes: 3
Reputation: 11321
Here's one way (df
your dataframe):
df.date = pd.to_datetime(df.date) # Just in case
customers = df.customer.unique()
df_fill = pd.DataFrame(
(
[day, customer]
for day in pd.date_range(df.date.min(), df.date.max(), freq="D")
for customer in customers
),
columns=["date", "customer"]
)
df = df_fill.merge(df, on=["date", "customer"], how="left")
df.attended = df.attended.fillna(False)
Output:
date customer attended
0 2022-01-01 John True
1 2022-01-01 Mark False
2 2022-01-02 John True
3 2022-01-02 Mark False
4 2022-01-03 John False
5 2022-01-03 Mark False
6 2022-01-04 John True
7 2022-01-04 Mark False
8 2022-01-05 John False
9 2022-01-05 Mark True
Upvotes: 1