Reputation: 43
I'm working with a dataset which has monthly information about several users. And each user has a different time range. There is also missing data for each user. What I would like to do is fill in the missing data for each user based on the time range for each user(from min.time to max.time in months).
Since, the data pattern falls into a time series, linear interpolation would not make sense. I had also set a multiindex for the data frame based on 'user' and 'date', but 'time' based interpolation doesn't work(as it's not yet been implemented)
x = pd.DataFrame({'user': ['a','a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','c','c','c','c','c','c','c','c'],'dt':['2015-01-01','2015-02-01','2015-03-01','2015-04-01','2015-05-01','2015-06-01','2015-07-01','2015-08-01','2016-01-01','2016-02-01','2016-03-01','2016-04-01','2016-05-01','2016-06-01','2016-07-01','2016-08-01','2016-09-01','2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01','2017-06-01','2017-07-01','2017-08-01'], 'val': [1,33,np.nan,1,np.nan,4,2,np.nan,66,2,5,1,np.nan,np.nan,7,5,np.nan,1,np.nan,7,4,np.nan,5,3,np.nan]})
user dt val
0 a 2015-01-01 1
1 a 2015-02-01 33
2 a 2015-03-01 NaN
3 a 2015-04-01 1
4 a 2015-05-01 NaN
5 a 2015-06-01 4
6 a 2015-07-01 2
7 a 2015-08-01 NaN
8 b 2016-01-01 66
9 b 2016-02-01 2
10 b 2016-03-01 5
11 b 2016-04-01 1
12 b 2016-05-01 NaN
13 b 2016-06-01 NaN
14 b 2016-07-01 7
15 b 2016-08-01 5
16 b 2016-09-01 NaN
17 c 2017-01-01 1
18 c 2017-02-01 NaN
19 c 2017-03-01 7
20 c 2017-04-01 4
21 c 2017-05-01 NaN
22 c 2017-06-01 5
23 c 2017-07-01 3
24 c 2017-08-01 NaN
In the above dataset, 'val' column has several missing values and I'm figuring out how to automate the process of doing this for several 'users'.
Also, will building a time series model for each user make sense given the no. of data points for each user?
Any input/workaround would be much appreciated.
Thanks, Luc.
Upvotes: 3
Views: 1939
Reputation: 30920
You could use Groupby + apply to fill in the missing values depending on the user. Without the need to create a series for each user.
Here is an example of how you could fill in this missing data( In addition to these methods you can also create your own function):
1.interpolating for each group with interpolate.Filling with fillna
x['val']=x.groupby('user')['val'].apply(lambda x: x.fillna(x.interpolate()))
print(x)
user dt val
0 a 2015-01-01 1.0
1 a 2015-02-01 33.0
2 a 2015-03-01 17.0
3 a 2015-04-01 1.0
4 a 2015-05-01 2.5
5 a 2015-06-01 4.0
6 a 2015-07-01 2.0
7 a 2015-08-01 2.0
8 b 2016-01-01 66.0
9 b 2016-02-01 2.0
10 b 2016-03-01 5.0
11 b 2016-04-01 1.0
12 b 2016-05-01 3.0
13 b 2016-06-01 5.0
14 b 2016-07-01 7.0
15 b 2016-08-01 5.0
16 b 2016-09-01 5.0
17 c 2017-01-01 1.0
18 c 2017-02-01 4.0
19 c 2017-03-01 7.0
20 c 2017-04-01 4.0
21 c 2017-05-01 4.5
22 c 2017-06-01 5.0
23 c 2017-07-01 3.0
24 c 2017-08-01 3.0
2. ffill method:
x['val']=x.groupby('user')['val'].apply(lambda x: x.ffill())
user dt val
0 a 2015-01-01 1.0
1 a 2015-02-01 33.0
2 a 2015-03-01 33.0
3 a 2015-04-01 1.0
4 a 2015-05-01 1.0
5 a 2015-06-01 4.0
6 a 2015-07-01 2.0
7 a 2015-08-01 2.0
8 b 2016-01-01 66.0
9 b 2016-02-01 2.0
10 b 2016-03-01 5.0
11 b 2016-04-01 1.0
12 b 2016-05-01 1.0
13 b 2016-06-01 1.0
14 b 2016-07-01 7.0
15 b 2016-08-01 5.0
16 b 2016-09-01 5.0
17 c 2017-01-01 1.0
18 c 2017-02-01 1.0
19 c 2017-03-01 7.0
20 c 2017-04-01 4.0
21 c 2017-05-01 4.0
22 c 2017-06-01 5.0
23 c 2017-07-01 3.0
24 c 2017-08-01 3.0
3.Using mean of group. Filling with fillna:
x['val']=x.groupby('user')['val'].apply(lambda x: x.fillna(x.mean()))
print(x)
user dt val
0 a 2015-01-01 1.000000
1 a 2015-02-01 33.000000
2 a 2015-03-01 8.200000
3 a 2015-04-01 1.000000
4 a 2015-05-01 8.200000
5 a 2015-06-01 4.000000
6 a 2015-07-01 2.000000
7 a 2015-08-01 8.200000
8 b 2016-01-01 66.000000
9 b 2016-02-01 2.000000
10 b 2016-03-01 5.000000
11 b 2016-04-01 1.000000
12 b 2016-05-01 14.333333
13 b 2016-06-01 14.333333
14 b 2016-07-01 7.000000
15 b 2016-08-01 5.000000
16 b 2016-09-01 14.333333
17 c 2017-01-01 1.000000
18 c 2017-02-01 4.000000
19 c 2017-03-01 7.000000
20 c 2017-04-01 4.000000
21 c 2017-05-01 4.000000
22 c 2017-06-01 5.000000
23 c 2017-07-01 3.000000
24 c 2017-08-01 4.000000
Upvotes: 3
Reputation: 231
That's how I would do.
The assumption is that you know how you want to infer the missing values.
Then I would create the function you use to impute the data, it should be something like that:
def f(x):
# x is a DataFrame with columns [datetime, value]
# ...
# ...
output = ...
# output is a dataframe with columns [datetime, value],
# where value is a column with the data imputed (so without nans)
return output
then you can just do:
x.groupby(user).apply(f)
Upvotes: 0