Reputation: 25366
I have the following data frame df
:
time col_A
0 1520582580.000 79.000
1 1520582880.000 22.500
2 1520583180.000 29.361
3 1520583480.000 116.095
4 1520583780.000 19.972
5 1520584080.000 36.857
6 1520584380.000 15.167
7 1520584680.000 nan
8 1520584980.000 nan
9 1520585280.000 nan
10 1520585580.000 34.500
11 1520585880.000 17.583
12 1520586180.000 nan
13 1520586480.000 48.833
14 1520586780.000 18.806
15 1520587080.000 18.583
col_A
has some missing data. I would like to create a col_B
, which takes the previous value for each missing record. i.e.
6 1520584380.000 15.167
7 1520584680.000 15.167
8 1520584980.000 15.167
9 1520585280.000 15.167
10 1520585580.000 34.500
11 1520585880.000 17.583
12 1520586180.000 17.583
13 1520586480.000 48.833
and a col_C
, which interpolate using the closest before and after non-missing points. i.e.
6 1520584380.000 15.167
7 1520584680.000 20.001
8 1520584980.000 24.834
9 1520585280.000 29.667
10 1520585580.000 34.500
11 1520585880.000 17.583
12 1520586180.000 33.208
13 1520586480.000 48.833
Other than loop over the data frame to do the computation record by record, is there a built-in function that I can use to achieve this in an elegant way? Thanks!
Upvotes: 5
Views: 2322
Reputation: 862851
I think need ffill
with interpolate
:
df['colB'] = df['col_A'].ffill()
df['colc'] = df['col_A'].interpolate()
print (df)
time col_A colB colc
0 1.520583e+09 79.000 79.000 79.00000
1 1.520583e+09 22.500 22.500 22.50000
2 1.520583e+09 29.361 29.361 29.36100
3 1.520583e+09 116.095 116.095 116.09500
4 1.520584e+09 19.972 19.972 19.97200
5 1.520584e+09 36.857 36.857 36.85700
6 1.520584e+09 15.167 15.167 15.16700
7 1.520585e+09 NaN 15.167 20.00025
8 1.520585e+09 NaN 15.167 24.83350
9 1.520585e+09 NaN 15.167 29.66675
10 1.520586e+09 34.500 34.500 34.50000
11 1.520586e+09 17.583 17.583 17.58300
12 1.520586e+09 NaN 17.583 33.20800
13 1.520586e+09 48.833 48.833 48.83300
14 1.520587e+09 18.806 18.806 18.80600
15 1.520587e+09 18.583 18.583 18.58300
If want use method time
for interpolate:
df['time'] = pd.to_datetime(df['time'], unit='s')
df = df.set_index('time')
df['colB'] = df['col_A'].ffill()
df['colc'] = df['col_A'].interpolate('time')
print (df)
col_A colB colc
time
2018-03-09 08:03:00 79.000 79.000 79.00000
2018-03-09 08:08:00 22.500 22.500 22.50000
2018-03-09 08:13:00 29.361 29.361 29.36100
2018-03-09 08:18:00 116.095 116.095 116.09500
2018-03-09 08:23:00 19.972 19.972 19.97200
2018-03-09 08:28:00 36.857 36.857 36.85700
2018-03-09 08:33:00 15.167 15.167 15.16700
2018-03-09 08:38:00 NaN 15.167 20.00025
2018-03-09 08:43:00 NaN 15.167 24.83350
2018-03-09 08:48:00 NaN 15.167 29.66675
2018-03-09 08:53:00 34.500 34.500 34.50000
2018-03-09 08:58:00 17.583 17.583 17.58300
2018-03-09 09:03:00 NaN 17.583 33.20800
2018-03-09 09:08:00 48.833 48.833 48.83300
2018-03-09 09:13:00 18.806 18.806 18.80600
2018-03-09 09:18:00 18.583 18.583 18.58300
Upvotes: 5