Reputation: 7693
I have a data frame like as shown below
df = pd.DataFrame({'person_id': [11,11,11,21,21,21,31,31,31,31,31],
'time' :[-1,5,17,11,25,39,46,4,100,150,1],
'value':[101,102,121,120,143,153,160,170,96,97,99]})
What I would like to do is
a) FIll in the missing time
by generating a sequence number (ex:1,2,3,4) and copy the value (for all other columns) from the previous row
I was trying something like below
df.groupby(['person_id']).cumcount() + 1
df['sequence'] = g.cumcount() + 1
But this doesn't help me get the expected output
I expect my output to be like as shown below (SAMPLE OF 1 SUBJECT IS SHOWN BELOW)
Upvotes: 3
Views: 803
Reputation: 28644
the complete function from pyjanitor can help with missing rows; it can handle duplicates as well :
#pip install pyjanitor
import pandas as pd
import janitor
# build a dictionary
# pairing time with range of min and max time
time = {'time' : lambda df: range(df.min(), df.max()+1)}
# complete the rows
# the expansion is done per `person_id`
outcome = (df.complete(time, by='person_id', sort = True)
.ffill(downcast='infer'))
outcome
person_id time value
0 11 -1 101
1 11 0 101
2 11 1 101
3 11 2 101
4 11 3 101
.. ... ... ...
193 31 146 96
194 31 147 96
195 31 148 96
196 31 149 96
197 31 150 97
outcome.query('person_id == 11')
person_id time value
0 11 -1 101
1 11 0 101
2 11 1 101
3 11 2 101
4 11 3 101
5 11 4 101
6 11 5 102
7 11 6 102
8 11 7 102
9 11 8 102
10 11 9 102
11 11 10 102
12 11 11 102
13 11 12 102
14 11 13 102
15 11 14 102
16 11 15 102
17 11 16 102
18 11 17 121
Upvotes: 2
Reputation: 2757
The key is to 'reindex' per group min/max and fill forward. See one of the approaches below,
(df.groupby('person_id')['time']
.apply(lambda x:np.arange(x.min(), x.max()))
.explode()
.reset_index()
.merge(df, how='left')
.ffill()
)
Upvotes: 1
Reputation: 71689
Let's set the time
column as the index of dataframe then groupby
the dataframe on person_id
then for each group classified by person_id
reindex
the group to conform its index
with the range of values specified in time
column, finally concat
all the groups to get the desired dataframe:
grp = df.set_index('time').groupby('person_id')
groups = [g.reindex(range(g.index.min(), g.index.max() + 1)).ffill().reset_index() for _, g in grp]
out = pd.concat(groups, ignore_index=True).reindex(df.columns, axis=1)
Alternatively you can first create tuple pairs for each person_id
and corresponding range of values specified in time
column then reindex
the dataframe:
grp = df.groupby('person_id')['time']
idx = [(k, n) for k, t in grp for n in range(t.min(), t.max() + 1)]
out = df.set_index(['person_id', 'time']).reindex(idx).ffill().reset_index()
Result (for person_id 11
):
person_id time value
0 11.0 -1 101.0
1 11.0 0 101.0
2 11.0 1 101.0
3 11.0 2 101.0
4 11.0 3 101.0
5 11.0 4 101.0
6 11.0 5 102.0
7 11.0 6 102.0
8 11.0 7 102.0
9 11.0 8 102.0
10 11.0 9 102.0
11 11.0 10 102.0
12 11.0 11 102.0
13 11.0 12 102.0
14 11.0 13 102.0
15 11.0 14 102.0
16 11.0 15 102.0
17 11.0 16 102.0
18 11.0 17 121.0
Upvotes: 4