Reputation: 43
uuid site ts visit
0 +CW99 1124 2013-06-24 2
1 +CW99 1124 2013-06-26 1
2 +CW99 1124 2013-06-27 1
3 +CW99 1124 2013-06-20 1
4 +CW99 1124 2013-06-21 1
5 +CW99 1124 2013-06-24 2
6 +CW9W 956 2013-06-21 4
7 +CW9W 956 2013-06-22 2
8 +CW9W 956 2013-06-23 3
9 +CW9W 956 2013-06-24 4
I have a dataframe that is passed in looks like that. I don't really know how many distinct ts period are there. So I want to find the min and max of the ts then expands the rows that does not have all the ts (daily) to zero fill the visit column while copying the rest of the fields
uuid site ts visit
0 +CW99 1124 2013-06-20 1
1 +CW99 1124 2013-06-21 1
2 +CW99 1124 2013-06-22 1
3 +CW99 1124 2013-06-23 0
4 +CW99 1124 2013-06-24 2
5 +CW99 1124 2013-06-25 0
6 +CW99 1124 2013-06-26 1
7 +CW99 1124 2013-06-27 1
8 +CW9W 956 2013-06-20 0
9 +CW9W 956 2013-06-21 4
10 +CW9W 956 2013-06-22 2
11 +CW9W 956 2013-06-23 3
12 +CW9W 956 2013-06-24 4
13 +CW9W 956 2013-06-25 0
14 +CW9W 956 2013-06-26 0
15 +CW9W 956 2013-06-27 0
Thanks.
Upvotes: 4
Views: 750
Reputation: 128918
This is non-trivial. I'll explain why below
Prelim, read in the original data frame and make sure that the ts
column is a dtype of datetime64[ns]
# you may need to do this to get the correct dtype
df['ts'] = df['ts'].to_datetime(df['ts'])
In [107]: df
Out[107]:
uuid site ts visit
0 +CW99 1124 2013-06-24 00:00:00 2
1 +CW99 1124 2013-06-26 00:00:00 1
2 +CW99 1124 2013-06-27 00:00:00 1
3 +CW99 1124 2013-06-20 00:00:00 1
4 +CW99 1124 2013-06-21 00:00:00 1
5 +CW99 1124 2013-06-24 00:00:00 2
6 +CW9W 956 2013-06-21 00:00:00 4
7 +CW9W 956 2013-06-22 00:00:00 2
8 +CW9W 956 2013-06-23 00:00:00 3
9 +CW9W 956 2013-06-24 00:00:00 4
In [106]: df.dtypes
Out[106]:
uuid object
site int64
ts datetime64[ns]
visit int64
dtype: object
Create the master times between min and max
In [110]: all_ts = pd.date_range(df['ts'].min(),df['ts'].max())
In [111]: all_ts
Out[111]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-06-20 00:00:00, ..., 2013-06-27 00:00:00]
Length: 8, Freq: D, Timezone: None
Define a function like this
In [103]: def f(x):
# we want all of the ``ts`` column that are not in the master time series
.....: adf = DataFrame(dict(ts = all_ts-Index(x['ts'])),columns=df.columns)
# they should have visit of 0
.....: adf['visit'] = 0
# first add them to the frame (x), ignoring the index
# sort by the ts column
# then fillforward missing values
.....: return x.append(adf,ignore_index=True).sort_index(by='ts').ffill()
.....:
Apply the function (you could group by uuid,site if you want as well)
In [116]: df.groupby('uuid').apply(f)
Out[116]:
uuid site ts visit
uuid
+CW99 3 +CW99 1124 2013-06-20 00:00:00 1
4 +CW99 1124 2013-06-21 00:00:00 1
0 +CW99 1124 2013-06-24 00:00:00 2
5 +CW99 1124 2013-06-24 00:00:00 2
6 +CW99 1124 2013-06-25 00:00:00 0
1 +CW99 1124 2013-06-26 00:00:00 1
2 +CW99 1124 2013-06-27 00:00:00 1
+CW9W 0 +CW9W 956 2013-06-21 00:00:00 4
1 +CW9W 956 2013-06-22 00:00:00 2
2 +CW9W 956 2013-06-23 00:00:00 3
3 +CW9W 956 2013-06-24 00:00:00 4
4 +CW9W 956 2013-06-25 00:00:00 0
Note: you have a duplicate in the posted frame. Not sure if that was intential or not
I preserved that. This is a somewhat easier problem if you don't have duplicates (in the ts
column)
Here is the no dup way
In [207]: def f(x):
.....: x = x.set_index('ts').reindex(all_ts).reset_index()
.....: x['visit'] = x['visit'].fillna(0)
.....: return x.ffill()
.....:
In [208]: df_no_dups.groupby('uuid').apply(f)
Out[208]:
index uuid site visit
uuid
+CW99 0 2013-06-20 00:00:00 +CW99 1124 1
1 2013-06-21 00:00:00 +CW99 1124 1
2 2013-06-22 00:00:00 +CW99 1124 0
3 2013-06-23 00:00:00 +CW99 1124 0
4 2013-06-24 00:00:00 +CW99 1124 2
5 2013-06-25 00:00:00 +CW99 1124 0
6 2013-06-26 00:00:00 +CW99 1124 1
7 2013-06-27 00:00:00 +CW99 1124 1
+CW9W 0 2013-06-20 00:00:00 NaN NaN 0
1 2013-06-21 00:00:00 +CW9W 956 4
2 2013-06-22 00:00:00 +CW9W 956 2
3 2013-06-23 00:00:00 +CW9W 956 3
4 2013-06-24 00:00:00 +CW9W 956 4
5 2013-06-25 00:00:00 +CW9W 956 0
6 2013-06-26 00:00:00 +CW9W 956 0
7 2013-06-27 00:00:00 +CW9W 956 0
This forces all the elements to be there (note the NaN
becuase no way to ffill
on the first element). You could drop these if you want.
Upvotes: 4