Reputation: 1539
I have the following pandas DataFrame:
Input
df = pd.DataFrame(
{'DeviceID' : ['358639059721158', '358639059721158', '358639059721158', '358639059721158', '358639059721158', '358639059721304', '358639059721304'],
'AssetSerialNo':['BKK-XYZ00005', 'BKK-XYZ00005', 'BKK-XYZ00005', 'BKK-XYZ00005', 'BKK-XYZ00005', 'IST-XYZ00004', 'IST-XYZ00004'],
'Date' :['9/5/2018 6:00', '9/5/2018 6:01', '9/5/2018 6:59', '9/5/2018 7:04', '9/5/2018 8:54', '28/5/2018 23:40', '28/5/2018 23:59'],
'ParameterValue': [0.475, 0.375, 0.175, 0.575, 0.115, 0.234, 0.575],
'Tc': [-18.000000, -16.000000, -14.000000, -12.000000, -11.000000, -29.000000, -28.000000],
'compressor': [0, 0, 1, 1, 1, 1, 1],
'Ta': [31.784826, 32.784826, 33.784826, 43.784826, 23.784826, 54.784826, 31.784826],
'label': [0, 0, 0, 1, 0, 0, 0],
'FaultType_Rcap': [0, 0, 0, 0, 0, 0, 0],
'FaultType_Rec': [0, 0, 0, 0, 0, 0, 0],
'FaultType_Ri': [0, 0, 0, 0, 0, 0, 0],
'FaultType_normal': [1, 1, 1, 1, 1, 1, 1]})
(Added a dummy table if you want to replicate the task)
As you can see I have a 'Date' column which increases by 1 minute every hour. Although some rows may not generate 60 rows of data per hour (1 hour = 60 min). This is the case with the device id "358639059721304" which sent only two rows between 23:00-00:00.
What I would like to do is to group the rows of the DataFrame by 1-hour interval per device. So, for example, I would like something like the following:
Output
| DeviceID | AssetSerialNo | Date | ParameterValue | Tc | compressor | Ta | label | FaultType_Rcap | FaultType_Rcond | FaultType_Rec | FaultType_Ri | FaultType_normal |
|-----------------|---------------|-----------------|-----------------------|--------------------------------------|------------|-----------------------------------|-------|----------------|-----------------|---------------|--------------|------------------|
| 358639059721158 | BKK-XYZ00005 | 9/5/2018 6:00 | (0.475+0.325+0.525)/3 | (-18.000000+-16.720951+-11.720951)/3 | (0+0+1)/3 | (24.742001+20.742001+35.742001)/3 | 1 | 0 | 0 | 0 | 0 | 1 |
| 358639059721158 | BKK-XYZ00005 | 9/5/2018 7:04 | (0.225_0.105)/2 | (-0.720951+-22.720951)/2 | (1+1)/2 | (5.742001+15.742001)/2 | 1 | 0 | 0 | 0 | 0 | 1 |
| 358639059721304 | IST-XYZ00004 | 28/5/2018 23:40 | (0.127544+0.097544)/2 | (-11.720951+-15.288659)/2 | (0+0)/2 | (31.742001+11.742001)/2 | 1 | 0 | 0 | 0 | 0 | 1 |
So the output groups the data by 1 hour interval and finds the average for the rest of the columns. My goal is to downsample the data. In total the 1,000,000 rows to become much less, grouping the data by 1 hour per device. I would like your thoughs on this using pandas DataFrame, although I would like also an approach with PySpark. But Pandas solution comes first.
My current approach (not complete)
date_obj=datetime.strptime(dataframe_classification['Date'].iloc[0], '%d/%m/%Y %H:%M') #create datetime format of "Date" column
[(date_time_obj + timedelta(minutes=x)).strftime("%M") for x in range(60)] #apply some timedelta 60 minutes later
Thank you in advance for any help and advice.
Upvotes: 1
Views: 473