The Great
The Great

Reputation: 7693

pandas group by and fill in the missing time interval sequence

I have a data frame like as shown below

df = pd.DataFrame({'person_id': [11,11,11,21,21,21,31,31,31,31,31],
                   'time' :[-1,5,17,11,25,39,46,4,100,150,1],
                   'value':[101,102,121,120,143,153,160,170,96,97,99]})

What I would like to do is

a) FIll in the missing time by generating a sequence number (ex:1,2,3,4) and copy the value (for all other columns) from the previous row

I was trying something like below

df.groupby(['person_id']).cumcount() + 1
df['sequence'] = g.cumcount() + 1

But this doesn't help me get the expected output

I expect my output to be like as shown below (SAMPLE OF 1 SUBJECT IS SHOWN BELOW)

enter image description here

Upvotes: 3

Views: 803

Answers (3)

sammywemmy
sammywemmy

Reputation: 28644

the complete function from pyjanitor can help with missing rows; it can handle duplicates as well :

#pip install pyjanitor
import pandas as pd
import janitor

# build a dictionary
# pairing time with range of min and max time
time = {'time' : lambda df: range(df.min(), df.max()+1)}

# complete the rows
# the expansion is done per `person_id`
outcome = (df.complete(time, by='person_id', sort = True)
             .ffill(downcast='infer'))
 
outcome
     person_id  time  value
0           11    -1    101
1           11     0    101
2           11     1    101
3           11     2    101
4           11     3    101
..         ...   ...    ...
193         31   146     96
194         31   147     96
195         31   148     96
196         31   149     96
197         31   150     97


outcome.query('person_id == 11')
    person_id  time  value
0          11    -1    101
1          11     0    101
2          11     1    101
3          11     2    101
4          11     3    101
5          11     4    101
6          11     5    102
7          11     6    102
8          11     7    102
9          11     8    102
10         11     9    102
11         11    10    102
12         11    11    102
13         11    12    102
14         11    13    102
15         11    14    102
16         11    15    102
17         11    16    102
18         11    17    121

Upvotes: 2

Mark Wang
Mark Wang

Reputation: 2757

The key is to 'reindex' per group min/max and fill forward. See one of the approaches below,

(df.groupby('person_id')['time']
   .apply(lambda x:np.arange(x.min(), x.max()))
   .explode()
   .reset_index()
   .merge(df, how='left')
   .ffill()
)

Upvotes: 1

Shubham Sharma
Shubham Sharma

Reputation: 71689

Let's set the time column as the index of dataframe then groupby the dataframe on person_id then for each group classified by person_id reindex the group to conform its index with the range of values specified in time column, finally concat all the groups to get the desired dataframe:

grp = df.set_index('time').groupby('person_id')
groups = [g.reindex(range(g.index.min(), g.index.max() + 1)).ffill().reset_index() for _, g in grp]
out = pd.concat(groups, ignore_index=True).reindex(df.columns, axis=1)

Alternatively you can first create tuple pairs for each person_id and corresponding range of values specified in time column then reindex the dataframe:

grp = df.groupby('person_id')['time']
idx = [(k, n) for k, t in grp  for n in range(t.min(), t.max() + 1)]
out = df.set_index(['person_id', 'time']).reindex(idx).ffill().reset_index()

Result (for person_id 11):

    person_id  time  value
0        11.0    -1  101.0
1        11.0     0  101.0
2        11.0     1  101.0
3        11.0     2  101.0
4        11.0     3  101.0
5        11.0     4  101.0
6        11.0     5  102.0
7        11.0     6  102.0
8        11.0     7  102.0
9        11.0     8  102.0
10       11.0     9  102.0
11       11.0    10  102.0
12       11.0    11  102.0
13       11.0    12  102.0
14       11.0    13  102.0
15       11.0    14  102.0
16       11.0    15  102.0
17       11.0    16  102.0
18       11.0    17  121.0

Upvotes: 4

Related Questions