FranG91
FranG91

Reputation: 83

Create DataFrame rows from date range unnesting field

I have the following Pandas DataFrame:

ID       start_date        end_date        codes         type
1        2019-01-01       2019-01-05      [x, y]          A
2        2019-01-01       2019-01-05      [x, y, z]       B

What I want to do is to generate the same number of rows as the range between two dates, for each code. The output will be this:

ID          date              codes        type
1        2019-01-01            x            A
1        2019-01-02            x            A
1        2019-01-03            x            A
1        2019-01-04            x            A
1        2019-01-05            x            A
1        2019-01-01            y            A
1        2019-01-02            y            A
1        2019-01-03            y            A
1        2019-01-04            y            A
1        2019-01-05            y            A
2        2019-01-01            x            B
2        2019-01-02            x            B
.....

Thank you very much!

Upvotes: 1

Views: 33

Answers (1)

ansev
ansev

Reputation: 30920

Pandas > 0.25.0

#if necessary
#df['start_date']= pd.to_datetime(df['start_date'])
#df['end_date']= pd.to_datetime(df['end_date'])
new_df = (df.melt(['ID','type','codes'],value_name = 'date')
            .set_index('date')
            .groupby(['ID','type'])
            .resample('D').ffill()
            .drop(columns = 'variable')
            .explode('codes')
            .reset_index(level=[0,1],drop=True)
            .sort_values(['ID','type','codes'])
            .reset_index()
            .reindex(columns = ['ID','date','codes','type'])
         )
print(new_df)

Pandas < 0.25.0

#if necessary
#df['start_date']= pd.to_datetime(df['start_date'])
#df['end_date']= pd.to_datetime(df['end_date'])
new_df = (df.melt(['ID','type','codes'],value_name = 'date')
            .set_index('date')
            .groupby(['ID','type'])
            .resample('D').ffill()
            .drop(columns = 'variable'))

new_df = (new_df.reindex(new_df.index.repeat(new_df.codes.str.len()))
                .assign(codes=np.concatenate(new_df.codes.values))
                .reset_index(level=[0,1],drop=True)
                .sort_values(['ID','type','codes'])
                .reset_index()
                .reindex(columns = ['ID','date','codes','type']))

print(new_df)

Output

    ID       date codes type
0    1 2019-01-01     x    A
1    1 2019-01-02     x    A
2    1 2019-01-03     x    A
3    1 2019-01-04     x    A
4    1 2019-01-05     x    A
5    1 2019-01-01     y    A
6    1 2019-01-02     y    A
7    1 2019-01-03     y    A
8    1 2019-01-04     y    A
9    1 2019-01-05     y    A
10   2 2019-01-01     x    B
11   2 2019-01-02     x    B
12   2 2019-01-03     x    B
13   2 2019-01-04     x    B
14   2 2019-01-05     x    B
15   2 2019-01-01     y    B
16   2 2019-01-02     y    B
17   2 2019-01-03     y    B
18   2 2019-01-04     y    B
19   2 2019-01-05     y    B
20   2 2019-01-01     z    B
21   2 2019-01-02     z    B
22   2 2019-01-03     z    B
23   2 2019-01-04     z    B
24   2 2019-01-05     z    B

Upvotes: 2

Related Questions