Reputation: 1243
I have a dataframe that looks somewhat like this. There are 100,000 rows where each row has the name of a person and the start and end date of a trip they took.
The trips of the different travelers may overlap.
What I want to do is create a dataframe that stores their travel history as a time series where if they stayed home on a date, the value is zero and if they were travelling on a date the value is 1. So something like this:
Here is what I have done so far and it seems to work. I have a feeling that what I am doing may not be efficient and may not be best practice since I am populating a dataframe inside a loop. I'd appreciate any help in making my code better and me learning new stuff in the process. MWE below. Thanks.
First, I create the raw data.
import datetime
import pandas as pd
import numpy as np
my_dict = {'Name': ["A", "A", "A", "B", "B"], 'Start Date': ['1/1/2019', '4/5/2019', '7/7/2019', '1/1/2019', '12/1/2019'], 'End Date': ['1/6/2019', '4/20/2019', '7/9/2019', '1/17/2019', '12/10/2019']}
df_raw = pd.DataFrame.from_dict(my_dict)
df_raw['Start Date'] = pd.to_datetime(df_raw['Start Date'], infer_datetime_format=True)
df_raw['End Date'] = pd.to_datetime(df_raw['End Date'], infer_datetime_format=True)
Then I create a dataframe filled with zeros with appropriate index and column names
date_range2 = pd.date_range(start=df_raw['Start Date'].min(), end=df_raw['End Date'].max())
columns2 = df_raw['Name'].unique()
df_panel2 = pd.DataFrame(index=date_range2, columns=columns2)
df_panel2 = df_panel2.fillna(0)
Then I loop through my original dataframe and populate the new dataframe wherever needed
for index, row in df_raw.iterrows():
for date in pd.date_range(start=row['Start Date'], end=row['End Date']):
df_panel2.loc[date.strftime('%m/%d/%Y'),row['Name']] = 1
Upvotes: 1
Views: 320
Reputation: 75080
You can try with pd.IntervalIndex.
along with pd.get_dummies
after creating the date_range2
idx = pd.IntervalIndex.from_arrays(df_raw['Start Date'],
df_raw['End Date'],closed='both')
out = pd.get_dummies(df_raw.set_index(idx)['Name'].reindex(date_range2))
print(out.head(15))
A B
2019-01-01 1 0
2019-01-02 1 0
2019-01-03 1 0
2019-01-04 1 0
2019-01-05 1 0
2019-01-06 1 0
2019-01-07 0 0
2019-01-08 0 1
2019-01-09 0 1
2019-01-10 0 1
2019-01-11 0 0
2019-01-12 0 0
2019-01-13 0 0
2019-01-14 0 0
2019-01-15 0 0
Upvotes: 2