NikSp
NikSp

Reputation: 1539

Get the average value of columns grouped by 1 hour time window

I have the following pandas DataFrame:

Input

df = pd.DataFrame(
{'DeviceID' : ['358639059721158', '358639059721158', '358639059721158', '358639059721158', '358639059721158', '358639059721304', '358639059721304'],
 'AssetSerialNo':['BKK-XYZ00005', 'BKK-XYZ00005', 'BKK-XYZ00005', 'BKK-XYZ00005', 'BKK-XYZ00005', 'IST-XYZ00004', 'IST-XYZ00004'], 
 'Date' :['9/5/2018 6:00', '9/5/2018 6:01', '9/5/2018 6:59', '9/5/2018 7:04', '9/5/2018 8:54', '28/5/2018 23:40', '28/5/2018 23:59'],
 'ParameterValue': [0.475, 0.375, 0.175, 0.575, 0.115, 0.234, 0.575],
 'Tc': [-18.000000, -16.000000, -14.000000, -12.000000, -11.000000, -29.000000, -28.000000],
 'compressor': [0, 0, 1, 1, 1, 1, 1],
 'Ta': [31.784826, 32.784826, 33.784826, 43.784826, 23.784826, 54.784826, 31.784826],
 'label': [0, 0, 0, 1, 0, 0, 0],
 'FaultType_Rcap': [0, 0, 0, 0, 0, 0, 0],
 'FaultType_Rec': [0, 0, 0, 0, 0, 0, 0],
 'FaultType_Ri': [0, 0, 0, 0, 0, 0, 0],
 'FaultType_normal': [1, 1, 1, 1, 1, 1, 1]})

Dummt dataset of some metrics by minute (Added a dummy table if you want to replicate the task)

As you can see I have a 'Date' column which increases by 1 minute every hour. Although some rows may not generate 60 rows of data per hour (1 hour = 60 min). This is the case with the device id "358639059721304" which sent only two rows between 23:00-00:00.

What I would like to do is to group the rows of the DataFrame by 1-hour interval per device. So, for example, I would like something like the following:

Output

| DeviceID | AssetSerialNo | Date | ParameterValue | Tc | compressor | Ta | label | FaultType_Rcap | FaultType_Rcond | FaultType_Rec | FaultType_Ri | FaultType_normal | |-----------------|---------------|-----------------|-----------------------|--------------------------------------|------------|-----------------------------------|-------|----------------|-----------------|---------------|--------------|------------------| | 358639059721158 | BKK-XYZ00005 | 9/5/2018 6:00 | (0.475+0.325+0.525)/3 | (-18.000000+-16.720951+-11.720951)/3 | (0+0+1)/3 | (24.742001+20.742001+35.742001)/3 | 1 | 0 | 0 | 0 | 0 | 1 | | 358639059721158 | BKK-XYZ00005 | 9/5/2018 7:04 | (0.225_0.105)/2 | (-0.720951+-22.720951)/2 | (1+1)/2 | (5.742001+15.742001)/2 | 1 | 0 | 0 | 0 | 0 | 1 | | 358639059721304 | IST-XYZ00004 | 28/5/2018 23:40 | (0.127544+0.097544)/2 | (-11.720951+-15.288659)/2 | (0+0)/2 | (31.742001+11.742001)/2 | 1 | 0 | 0 | 0 | 0 | 1 | So the output groups the data by 1 hour interval and finds the average for the rest of the columns. My goal is to downsample the data. In total the 1,000,000 rows to become much less, grouping the data by 1 hour per device. I would like your thoughs on this using pandas DataFrame, although I would like also an approach with PySpark. But Pandas solution comes first.

My current approach (not complete)

date_obj=datetime.strptime(dataframe_classification['Date'].iloc[0], '%d/%m/%Y %H:%M') #create datetime format of "Date" column

[(date_time_obj + timedelta(minutes=x)).strftime("%M") for x in range(60)] #apply some timedelta 60 minutes later

Thank you in advance for any help and advice.

Upvotes: 1

Views: 473

Answers (1)

jamfie
jamfie

Reputation: 151

You might want to try Grouper

df['Date'] = pd.to_datetime(df.Date)
df.groupby(['DeviceID','AssetSerialNo', pd.Grouper(key = 'Date', freq = '60min')]).agg('mean')

Upvotes: 1

Related Questions