Pandas - Interpolating/imputing missing values within groups of multiple time series

I'm working with a dataset which has monthly information about several users. And each user has a different time range. There is also missing data for each user. What I would like to do is fill in the missing data for each user based on the time range for each user(from min.time to max.time in months).

Since, the data pattern falls into a time series, linear interpolation would not make sense. I had also set a multiindex for the data frame based on 'user' and 'date', but 'time' based interpolation doesn't work(as it's not yet been implemented)

x = pd.DataFrame({'user': ['a','a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','c','c','c','c','c','c','c','c'],'dt':['2015-01-01','2015-02-01','2015-03-01','2015-04-01','2015-05-01','2015-06-01','2015-07-01','2015-08-01','2016-01-01','2016-02-01','2016-03-01','2016-04-01','2016-05-01','2016-06-01','2016-07-01','2016-08-01','2016-09-01','2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01','2017-06-01','2017-07-01','2017-08-01'], 'val': [1,33,np.nan,1,np.nan,4,2,np.nan,66,2,5,1,np.nan,np.nan,7,5,np.nan,1,np.nan,7,4,np.nan,5,3,np.nan]})

user         dt     val
0     a 2015-01-01    1
1     a 2015-02-01   33
2     a 2015-03-01  NaN
3     a 2015-04-01    1
4     a 2015-05-01  NaN
5     a 2015-06-01    4
6     a 2015-07-01    2
7     a 2015-08-01  NaN
8     b 2016-01-01   66
9     b 2016-02-01    2
10    b 2016-03-01    5
11    b 2016-04-01    1
12    b 2016-05-01  NaN
13    b 2016-06-01  NaN
14    b 2016-07-01    7
15    b 2016-08-01    5
16    b 2016-09-01  NaN
17    c 2017-01-01    1
18    c 2017-02-01  NaN
19    c 2017-03-01    7
20    c 2017-04-01    4
21    c 2017-05-01  NaN
22    c 2017-06-01    5
23    c 2017-07-01    3
24    c 2017-08-01  NaN

In the above dataset, 'val' column has several missing values and I'm figuring out how to automate the process of doing this for several 'users'.

Also, will building a time series model for each user make sense given the no. of data points for each user?

Any input/workaround would be much appreciated.

Thanks, Luc.

Upvotes: 3

Answers (2)

ansev

Reputation: 30930

You could use Groupby + apply to fill in the missing values depending on the user. Without the need to create a series for each user.

Here is an example of how you could fill in this missing data( In addition to these methods you can also create your own function):

1.interpolating for each group with interpolate.Filling with fillna

x['val']=x.groupby('user')['val'].apply(lambda x: x.fillna(x.interpolate()))
print(x)

   user          dt   val
0     a  2015-01-01   1.0
1     a  2015-02-01  33.0
2     a  2015-03-01  17.0
3     a  2015-04-01   1.0
4     a  2015-05-01   2.5
5     a  2015-06-01   4.0
6     a  2015-07-01   2.0
7     a  2015-08-01   2.0
8     b  2016-01-01  66.0
9     b  2016-02-01   2.0
10    b  2016-03-01   5.0
11    b  2016-04-01   1.0
12    b  2016-05-01   3.0
13    b  2016-06-01   5.0
14    b  2016-07-01   7.0
15    b  2016-08-01   5.0
16    b  2016-09-01   5.0
17    c  2017-01-01   1.0
18    c  2017-02-01   4.0
19    c  2017-03-01   7.0
20    c  2017-04-01   4.0
21    c  2017-05-01   4.5
22    c  2017-06-01   5.0
23    c  2017-07-01   3.0
24    c  2017-08-01   3.0

2. ffill method:

x['val']=x.groupby('user')['val'].apply(lambda x: x.ffill())

   user          dt   val
0     a  2015-01-01   1.0
1     a  2015-02-01  33.0
2     a  2015-03-01  33.0
3     a  2015-04-01   1.0
4     a  2015-05-01   1.0
5     a  2015-06-01   4.0
6     a  2015-07-01   2.0
7     a  2015-08-01   2.0
8     b  2016-01-01  66.0
9     b  2016-02-01   2.0
10    b  2016-03-01   5.0
11    b  2016-04-01   1.0
12    b  2016-05-01   1.0
13    b  2016-06-01   1.0
14    b  2016-07-01   7.0
15    b  2016-08-01   5.0
16    b  2016-09-01   5.0
17    c  2017-01-01   1.0
18    c  2017-02-01   1.0
19    c  2017-03-01   7.0
20    c  2017-04-01   4.0
21    c  2017-05-01   4.0
22    c  2017-06-01   5.0
23    c  2017-07-01   3.0
24    c  2017-08-01   3.0

3.Using mean of group. Filling with fillna:

x['val']=x.groupby('user')['val'].apply(lambda x: x.fillna(x.mean()))
print(x)

   user          dt        val
0     a  2015-01-01   1.000000
1     a  2015-02-01  33.000000
2     a  2015-03-01   8.200000
3     a  2015-04-01   1.000000
4     a  2015-05-01   8.200000
5     a  2015-06-01   4.000000
6     a  2015-07-01   2.000000
7     a  2015-08-01   8.200000
8     b  2016-01-01  66.000000
9     b  2016-02-01   2.000000
10    b  2016-03-01   5.000000
11    b  2016-04-01   1.000000
12    b  2016-05-01  14.333333
13    b  2016-06-01  14.333333
14    b  2016-07-01   7.000000
15    b  2016-08-01   5.000000
16    b  2016-09-01  14.333333
17    c  2017-01-01   1.000000
18    c  2017-02-01   4.000000
19    c  2017-03-01   7.000000
20    c  2017-04-01   4.000000
21    c  2017-05-01   4.000000
22    c  2017-06-01   5.000000
23    c  2017-07-01   3.000000
24    c  2017-08-01   4.000000

Upvotes: 3

Giacomo Sachs

Reputation: 231

That's how I would do.

The assumption is that you know how you want to infer the missing values.

Then I would create the function you use to impute the data, it should be something like that:

def f(x):
    # x is a DataFrame with columns [datetime, value]
    # ...
    # ...
    output = ...
    # output is a dataframe with columns [datetime, value],
    # where value is a column with the data imputed (so without nans)
    return output

then you can just do:

x.groupby(user).apply(f)

Upvotes: 0

Pandas - Interpolating/imputing missing values within groups of multiple time series

Answers (2)

Related Questions