Ajinkya Deshmukh
Ajinkya Deshmukh

Reputation: 53

Expand dataframe for each date | Pandas

I have a dataframe of user connections where UID represents a user, and date represents the date on which the user made connections (represented by #fans).

UID     Date        #fans
9305    1/25/2015   5
9305    2/26/2015   7
9305    3/27/2015   8
9305    4/1/2015    9
1305    6/6/2015    14
1305    6/26/2015   16
1305    6/27/2015   17

The date range of the dataframe is 01-01-2014 to 12-01-2020.

I need to expand the data such that for each user the date should contain each date in the date range and each date should have #fans as total connections of the user till that date. e.g. The desired output is

UID     Date        #fans
9305    1/1/2014    0
9305    1/2/2014    0
9305    1/3/2014    0
...
9305    1/25/2015   5
9305    1/26/2015   5
9305    1/27/2015   5
...
9305    2/26/2015   7
9305    3/27/2015   8
9305    3/28/2015   8
9305    3/29/2015   8
...
9305    4/1/2015    9
...
9305    12/1/2020   9

*for all the UIDs

I am unsure about what approach should I take here. Any help is appreciated.

Upvotes: 1

Views: 1241

Answers (1)

Joe Ferndz
Joe Ferndz

Reputation: 8508

The below code should give you the desired results.

  1. Step 1: Create a pd.Series of date range between 01-01-2014 and 12-01-2020.

    datelist = pd.date_range(start='01-01-2014', end='12-01-2020', freq='1d')

  2. Step 2: Get the length of the date range. In our case, it is 2527.

    nd = len(datelist)

  3. Step 3: Get the length of the unique values of UIDs in the original dataframe. In the example, we have 2

    nu = len(df['UID'].unique())

  4. Step 4: Create a DataFrame of two columns - UID and Date for the full range (2527 x 2)

    df_final = pd.DataFrame({'UID':df['UID'].unique().tolist()*nd, 'Date':np.repeat(datelist,nu)})

  5. Step 5: Now merge the original dataframe to df_final so you can get the specific value assigned to #fans.

    df_final = df_final.merge(df, how='left')

  6. Step 6: Group by UID and forward filling the rows. And if they are NA, set the value to 0

    df_final[['Date','#fans']] = df_final.groupby('UID')[['Date','#fans']].ffill().fillna(0)

  7. Step 7: Finally, we change the dtype of #fans to int otherwise values will be in float with xx.0

    df_final['#fans'] = df_final['#fans'].astype('int64')

Putting all this together, here's the code:

import pandas as pd
import numpy as np
from datetime import datetime

c = ['UID','Date','#fans']
d = [[9305,    '1/25/2015',   5],
[9305,    '2/26/2015',  7],
[9305,    '3/27/2015',   8],
[9305,    '4/1/2015',    9],
[1305,    '6/6/2015',   14],
[1305,    '6/26/2015',   16],
[1305,    '6/27/2015',   17]]
df = pd.DataFrame(d,columns=c)
df.Date = pd.to_datetime(df.Date)
print (df)

datelist = pd.date_range(start='01-01-2014', end='12-01-2020', freq='1d')
nd = len(datelist)
nu = len(df['UID'].unique())

df_final = pd.DataFrame({'UID':df['UID'].unique().tolist()*nd,
                         'Date':np.repeat(datelist,nu)})

df_final = df_final.merge(df, how='left')

df_final[['Date','#fans']] = df_final.groupby('UID')[['Date','#fans']].ffill().fillna(0)

df_final['#fans'] = df_final['#fans'].astype('int64')
print (df_final)

The output of this will be:

       UID       Date  #fans
1     1305 2014-01-01      0
3     1305 2014-01-02      0
5     1305 2014-01-03      0
7     1305 2014-01-04      0
9     1305 2014-01-05      0
...    ...        ...    ...
5044  9305 2020-11-27      9
5046  9305 2020-11-28      9
5048  9305 2020-11-29      9
5050  9305 2020-11-30      9
5052  9305 2020-12-01      9

The above code also takes into account the shift from one UID to the other.

The code will ensure the following:

       UID       Date  #fans
2526  1305 2020-12-01     17
       UID       Date  #fans
2527  9305 2014-01-01      0

Upvotes: 3

Related Questions