Reputation: 3028
I have a csv file which contains some data
,date,location,device,provider,cpu,mem,load,drops,id,latency,gw_latency,upload,download,sap_drops,sap_latency,alert_id
389,2018-02-13 09:20:17.572685+00:00,ASA,10.11.100.1,BOM,4.0,23.0,0.25,0.0,,,,,,,,
390,2018-02-13 09:20:21.836284+00:00,ASA,10.11.100.1,COD,4.0,23.0,2.08,0.0,,,,,,,,
391,2018-02-13 09:30:59.401178+00:00,ASA,10.11.100.1,COD,5.0,23.0,8.0,0.0,,,,,,,,
392,2018-02-13 09:31:03.667730+00:00,ASA,10.11.100.1,COD,5.0,23.0,3.5,0.0,,,,,,,,
393,2018-02-13 09:41:14.666626+00:00,ASA,10.11.100.1,BOM,4.0,23.0,0.5,0.0,,,,,,,,
394,2018-02-13 09:41:18.935061+00:00,ASA,10.11.100.1,DAE,4.0,23.0,3.0,0.0,,,,,,,,
395,2018-02-13 09:50:17.491014+00:00,ASA,10.11.100.1,DAE,5.0,23.0,8.25,0.0,,,,,,,,
396,2018-02-13 09:50:21.751805+00:00,BBB,10.11.100.1,BOM,5.0,23.0,2.75,0.0,,,,,,,,
397,2018-02-13 10:00:18.387647+00:00,BBB,10.11.100.1,CXU,5.0,23.0,2.0,0.0,,,,,,,,
398,2018-02-13 10:00:22.847626+00:00,ASA,10.11.100.1,BOM,5.0,23.0,3.17,0.0,,,,,,,,
399,2018-02-13 10:10:17.521642+00:00,BBB,10.11.100.1,DAE,5.0,23.0,1.0,0.0,,,,,,,,
400,2018-02-13 10:10:21.786720+00:00,BBB,10.11.100.1,DAE,5.0,23.0,2.42,0.0,,,,,,,,
401,2018-02-13 10:14:38.085999+00:00,BBB,10.11.100.1,CXU,4.0,23.0,0.25,0.0,,,,,,,,
..................................................................................
..................................................................................
As you can see there are lot of entries for the date 2018-02-13
across several time intervals.I want to put these into 24
hour interval where each hour will contain one value (averaged value).This is what I did
df_next = df.loc['2018-04-13'].resample('H')["cpu"].mean().fillna(0)
But the for the date 2018-04-13
, the data collected was only upto 10:00
hrs (last entry was made at 10:14:38
).So it only gives me upto that.For some other dates, if the data was collected from 9:00
onwards, then I only get the per hour interval from 9:00
only.
How do I get a complete 24
hours interval starting from 00:00
for a date no matter at what time data was collected?So basically its going to assign 0
for the hours when data was not collected and mean
for the hour when the data was collected?
So basically I want something like this
381,2018-02-13 00:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
382,2018-02-13 01:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
383,2018-02-13 02:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
384,2018-02-13 03:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
385,2018-02-13 04:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
386,2018-02-13 05:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
387,2018-02-13 06:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
388,2018-02-13 07:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
388,2018-02-13 08:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
389,2018-02-13 09:20:17.572685+00:00,ASA,10.11.100.1,BOM,4.0,23.0,0.25,0.0,,,,,,,,
390,2018-02-13 09:20:21.836284+00:00,ASA,10.11.100.1,COD,4.0,23.0,2.08,0.0,,,,,,,,
391,2018-02-13 09:30:59.401178+00:00,ASA,10.11.100.1,COD,5.0,23.0,8.0,0.0,,,,,,,,
392,2018-02-13 09:31:03.667730+00:00,ASA,10.11.100.1,COD,5.0,23.0,3.5,0.0,,,,,,,,
393,2018-02-13 09:41:14.666626+00:00,ASA,10.11.100.1,BOM,4.0,23.0,0.5,0.0,,,,,,,,
394,2018-02-13 09:41:18.935061+00:00,ASA,10.11.100.1,DAE,4.0,23.0,3.0,0.0,,,,,,,,
395,2018-02-13 09:50:17.491014+00:00,ASA,10.11.100.1,DAE,5.0,23.0,8.25,0.0,,,,,,,,
396,2018-02-13 09:50:21.751805+00:00,BBB,10.11.100.1,BOM,5.0,23.0,2.75,0.0,,,,,,,,
397,2018-02-13 10:00:18.387647+00:00,BBB,10.11.100.1,CXU,5.0,23.0,2.0,0.0,,,,,,,,
398,2018-02-13 10:00:22.847626+00:00,ASA,10.11.100.1,BOM,5.0,23.0,3.17,0.0,,,,,,,,
399,2018-02-13 10:10:17.521642+00:00,BBB,10.11.100.1,DAE,5.0,23.0,1.0,0.0,,,,,,,,
400,2018-02-13 10:10:21.786720+00:00,BBB,10.11.100.1,DAE,5.0,23.0,2.42,0.0,,,,,,,,
401,2018-02-13 10:14:38.085999+00:00,BBB,10.11.100.1,CXU,4.0,23.0,0.25,0.0,,,,,,,,
402,2018-02-13 11:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
403,2018-02-13 12:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
404,2018-02-13 13:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
405,2018-02-13 14:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
406,2018-02-13 15:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
407,2018-02-13 16:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
408,2018-02-13 17:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
409,2018-02-13 18:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
410,2018-02-13 19:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
411,2018-02-13 20:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
412,2018-02-13 21:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
413,2018-02-13 22:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
414,2018-02-13 23:00:00.000000+00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
As you can see it fills up the rest of the hours and keeps their values as 0
.
Upvotes: 1
Views: 380
Reputation: 863246
Use:
#get Series only for hourly data
#remove non exist hours by dropna
a = df.resample('H')["cpu"].mean().dropna()
#create all posible hours by first min and max value floor to 0 and 23 hour
rng = pd.date_range(a.index.min().floor('d'),
a.index.max().floor('d') + pd.Timedelta(23, unit='h'), freq='H')
#get all missing index values - missing hours
diff_idx = rng.difference(a.index)
#join new DataFrame with missing values to original, last sorting for correct ordering
df = pd.concat([df, pd.DataFrame(index=diff_idx, columns=df.columns)]).sort_index()
Upvotes: 1