Reputation: 53
I have a dataframe of user connections where UID represents a user, and date represents the date on which the user made connections (represented by #fans).
UID Date #fans
9305 1/25/2015 5
9305 2/26/2015 7
9305 3/27/2015 8
9305 4/1/2015 9
1305 6/6/2015 14
1305 6/26/2015 16
1305 6/27/2015 17
The date range of the dataframe is 01-01-2014 to 12-01-2020.
I need to expand the data such that for each user the date should contain each date in the date range and each date should have #fans as total connections of the user till that date. e.g. The desired output is
UID Date #fans
9305 1/1/2014 0
9305 1/2/2014 0
9305 1/3/2014 0
...
9305 1/25/2015 5
9305 1/26/2015 5
9305 1/27/2015 5
...
9305 2/26/2015 7
9305 3/27/2015 8
9305 3/28/2015 8
9305 3/29/2015 8
...
9305 4/1/2015 9
...
9305 12/1/2020 9
*for all the UIDs
I am unsure about what approach should I take here. Any help is appreciated.
Upvotes: 1
Views: 1241
Reputation: 8508
The below code should give you the desired results.
Step 1: Create a pd.Series of date range between 01-01-2014 and 12-01-2020.
datelist = pd.date_range(start='01-01-2014', end='12-01-2020', freq='1d')
Step 2: Get the length of the date range. In our case, it is 2527.
nd = len(datelist)
Step 3: Get the length of the unique values of UIDs in the original dataframe. In the example, we have 2
nu = len(df['UID'].unique())
Step 4: Create a DataFrame of two columns - UID and Date for the full range (2527 x 2)
df_final = pd.DataFrame({'UID':df['UID'].unique().tolist()*nd, 'Date':np.repeat(datelist,nu)})
Step 5: Now merge the original dataframe to df_final
so you can
get the specific value assigned to #fans.
df_final = df_final.merge(df, how='left')
Step 6: Group by UID and forward filling the rows. And if they are NA, set the value to 0
df_final[['Date','#fans']] = df_final.groupby('UID')[['Date','#fans']].ffill().fillna(0)
Step 7: Finally, we change the dtype of #fans to int otherwise values will be in float with xx.0
df_final['#fans'] = df_final['#fans'].astype('int64')
Putting all this together, here's the code:
import pandas as pd
import numpy as np
from datetime import datetime
c = ['UID','Date','#fans']
d = [[9305, '1/25/2015', 5],
[9305, '2/26/2015', 7],
[9305, '3/27/2015', 8],
[9305, '4/1/2015', 9],
[1305, '6/6/2015', 14],
[1305, '6/26/2015', 16],
[1305, '6/27/2015', 17]]
df = pd.DataFrame(d,columns=c)
df.Date = pd.to_datetime(df.Date)
print (df)
datelist = pd.date_range(start='01-01-2014', end='12-01-2020', freq='1d')
nd = len(datelist)
nu = len(df['UID'].unique())
df_final = pd.DataFrame({'UID':df['UID'].unique().tolist()*nd,
'Date':np.repeat(datelist,nu)})
df_final = df_final.merge(df, how='left')
df_final[['Date','#fans']] = df_final.groupby('UID')[['Date','#fans']].ffill().fillna(0)
df_final['#fans'] = df_final['#fans'].astype('int64')
print (df_final)
The output of this will be:
UID Date #fans
1 1305 2014-01-01 0
3 1305 2014-01-02 0
5 1305 2014-01-03 0
7 1305 2014-01-04 0
9 1305 2014-01-05 0
... ... ... ...
5044 9305 2020-11-27 9
5046 9305 2020-11-28 9
5048 9305 2020-11-29 9
5050 9305 2020-11-30 9
5052 9305 2020-12-01 9
The above code also takes into account the shift from one UID to the other.
The code will ensure the following:
UID Date #fans
2526 1305 2020-12-01 17
UID Date #fans
2527 9305 2014-01-01 0
Upvotes: 3