YannickAaron
YannickAaron

Reputation: 167

Efficient way to loop through GroupBy DataFrame

Since my last post did lack in information:

example of my df (the important col): deviceID: unique ID for the vehicle. Vehicles send data all Xminutes. mileage: the distance moved since the last message (in km) positon_timestamp_measure: unixTimestamp of the time the dataset was created.

deviceID mileage positon_timestamp_measure
54672      10       1600696079
43423      20       1600696079
42342      3        1600701501
54672      3        1600702102
43423      2        1600702701

My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.

What I've done so far is the following:

df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0

for group_name, group in df:

    #sort group by time
    group = group.sort_values(by='position_timestamp_measure')
    group = group.reset_index()

    #since I can't validate the first point in the group, I set it to valid
    df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
    
    #iterate through each data in the group
    
    for i in range(1, len(group)):
        timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
        timeHours = (timeGoneSec/60)/60
        #calculate speed
        if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
            df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
            
            

dataset.validPosition.value_counts()

It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.

Upvotes: 2

Views: 1147

Answers (1)

Azuuu
Azuuu

Reputation: 894

If I got it right, no for-loops are needed here. Here is what I've transformed your code into:

df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0

df_ori = df_ori.sort_values(['position_timestamp_measure'])

# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
    df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')


# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1

df_ori['timeHours'] = df_ori['timeGoneSec']/3600   # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH

df_ori.loc[df_ori['flag'], 'valid'] = 1

# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])

The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.

Since I can't get the context of your code, please double check the logic and make sure it works as desired.

Upvotes: 1

Related Questions