Ben
Ben

Reputation: 314

Pandas: How to fill missing dates in a long dataframe with multiple non overlapping time series?

I have a long dataframe with multiple timeseries which are non overlapping.

import numpy as np
import pandas as pd
df = pd.DataFrame({'id':[1,1,1,1,1,2,2,2,2,2,2],
                   't':[0,1,2,3,4,2,3,4,5,6,7],
                   'price':[10,10.2,10.8,10.1,10.4,142.1,142.8,143.6,142.8,141.4,140.7]})

The df looks like this

Out[65]: 
    id  t  price
0    1  0   10.0
1    1  1   10.2
2    1  2   10.8
3    1  3   10.1
4    1  4   10.4
5    2  2  142.1
6    2  3  142.8
7    2  4  143.6
8    2  5  142.8
9    2  6  141.4
10   2  7  140.7

For the time series with id 1, the missing timestamps are 5,6 and 7 and the time series #2 misses timestamps 0 and 1.

I would like to fill the missing dates for all the time series in the dataframe so all of them have all the dates filled with nan:

    df_target = pd.DataFrame({'id':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
                              't':[0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7],
                              'price':[10,10.2,10.8,10.1,10.4,np.nan,np.nan,np.nan,np.nan,np.nan,142.1,142.8,143.6,142.8,141.4,140.7]})

Out[68]: 
    id  t  price
0    1  0   10.0
1    1  1   10.2
2    1  2   10.8
3    1  3   10.1
4    1  4   10.4
5    1  5    NaN
6    1  6    NaN
7    1  7    NaN
8    2  0    NaN
9    2  1    NaN
10   2  2  142.1
11   2  3  142.8
12   2  4  143.6
13   2  5  142.8
14   2  6  141.4
15   2  7  140.7

The objective is to be able to then reshape this dataframe to a 3d array. Is there a simple way to fill missing dates for each time series? Thanks

Upvotes: 1

Views: 196

Answers (2)

sammywemmy
sammywemmy

Reputation: 28644

One option is to use the complete function from pyjanitor to expose the implicitly missing rows;

# pip install pyjanitor
import pandas as pd
import janitor

df.complete("id", {"t": range(7)}, sort=True)
 
    id  t  price
0    1  0   10.0
1    1  1   10.2
2    1  2   10.8
3    1  3   10.1
4    1  4   10.4
5    1  5    NaN
6    1  6    NaN
7    2  0    NaN
8    2  1    NaN
9    2  2  142.1
10   2  3  142.8
11   2  4  143.6
12   2  5  142.8
13   2  6  141.4
14   2  7  140.7

In the code above, complete combines id with a range of 0 to 7 for t to build a new dataframe, and expose the missing rows.

Upvotes: 1

jezrael
jezrael

Reputation: 862481

Use Series.unstack with DataFrame.stack:

df1 = (df.set_index(['id','t'])['price']
         .unstack()
         .stack(dropna=False)
         .reset_index(name='price'))
print (df1)
    id  t  price
0    1  0   10.0
1    1  1   10.2
2    1  2   10.8
3    1  3   10.1
4    1  4   10.4
5    1  5    NaN
6    1  6    NaN
7    1  7    NaN
8    2  0    NaN
9    2  1    NaN
10   2  2  142.1
11   2  3  142.8
12   2  4  143.6
13   2  5  142.8
14   2  6  141.4
15   2  7  140.7

Upvotes: 6

Related Questions