Amatya
Amatya

Reputation: 1243

Creating a Panel Dataframe

I have a dataframe that looks somewhat like this. There are 100,000 rows where each row has the name of a person and the start and end date of a trip they took.

The trips of the different travelers may overlap.

enter image description here

What I want to do is create a dataframe that stores their travel history as a time series where if they stayed home on a date, the value is zero and if they were travelling on a date the value is 1. So something like this:

enter image description here

Here is what I have done so far and it seems to work. I have a feeling that what I am doing may not be efficient and may not be best practice since I am populating a dataframe inside a loop. I'd appreciate any help in making my code better and me learning new stuff in the process. MWE below. Thanks.

First, I create the raw data.

import datetime
import pandas as pd
import numpy as np

my_dict = {'Name': ["A", "A", "A", "B", "B"], 'Start Date': ['1/1/2019', '4/5/2019', '7/7/2019', '1/1/2019', '12/1/2019'], 'End Date': ['1/6/2019', '4/20/2019', '7/9/2019', '1/17/2019', '12/10/2019']}
df_raw = pd.DataFrame.from_dict(my_dict)
df_raw['Start Date'] = pd.to_datetime(df_raw['Start Date'], infer_datetime_format=True) 
df_raw['End Date'] = pd.to_datetime(df_raw['End Date'], infer_datetime_format=True) 

Then I create a dataframe filled with zeros with appropriate index and column names

date_range2 = pd.date_range(start=df_raw['Start Date'].min(), end=df_raw['End Date'].max())
columns2 = df_raw['Name'].unique()

df_panel2 = pd.DataFrame(index=date_range2, columns=columns2)
df_panel2 = df_panel2.fillna(0)

Then I loop through my original dataframe and populate the new dataframe wherever needed

for index, row in df_raw.iterrows():
    for date in pd.date_range(start=row['Start Date'], end=row['End Date']):
        df_panel2.loc[date.strftime('%m/%d/%Y'),row['Name']] = 1

Upvotes: 1

Views: 320

Answers (1)

anky
anky

Reputation: 75080

You can try with pd.IntervalIndex. along with pd.get_dummies after creating the date_range2

idx = pd.IntervalIndex.from_arrays(df_raw['Start Date'],
                                   df_raw['End Date'],closed='both')

out = pd.get_dummies(df_raw.set_index(idx)['Name'].reindex(date_range2))

print(out.head(15))

            A  B
2019-01-01  1  0
2019-01-02  1  0
2019-01-03  1  0
2019-01-04  1  0
2019-01-05  1  0
2019-01-06  1  0
2019-01-07  0  0
2019-01-08  0  1
2019-01-09  0  1
2019-01-10  0  1
2019-01-11  0  0
2019-01-12  0  0
2019-01-13  0  0
2019-01-14  0  0
2019-01-15  0  0

Upvotes: 2

Related Questions