ortunoa
ortunoa

Reputation: 355

How to clean up or smoothen a time series using two criteria in Pandas

Sorry for the confusing title. I'm trying to clean up a dataset that has engine hours reported on different time intervals. I'm trying to detect and address two situations:

  1. Engine hours reported are less than last records engine hours

  2. Engine hours reported between two dates is greater than the hour difference between said dates

    Sampled Date    Meter   eng_hours_diff  date_hours_diff
     2017-02-02     5336         24                24
     2017-02-20     5578         242               432
     2017-02-22     5625         47                48
     2017-03-07     5930         305               312
     2017-05-16     6968         1038              1680
     2017-06-01     7182         214               384
     2017-06-22     7527         345               504
     2017-07-10     7919         392               432
     2017-07-25     16391        8472              360
     2017-08-20     8590        -7801              624
     2017-09-05     8827         237               384
     2017-09-26     9106         279               504
     2017-10-16     9406         300               480
     2017-10-28     9660         254               288
     2017-11-29     10175        515               768
    

What I would like to do is re-write the ['Meter'] series if either of the two scenarios above come up and take the average between the points around it.

I'm thinking that this might require two steps, one to eliminate any inaccuracy due to the difference in engine hours being > than the hours between the dates, and then re-calculate the ['eng_hours_diff'] column and check if there are still any that are negative.

The last two columns I've calculated like this:

dfa['eng_hours_diff'] = dfa['Meter'].diff().fillna(24)

dfa['date_hours_diff'] = dfa['Sampled Date'].diff().apply(lambda x:str(x)).apply(lambda x: x.split(' ')[0]).apply(lambda x:x.replace('NaT',"1")).apply(lambda x: int(x)*24)
    

EDIT: Ok I think I'm getting somewhere but not quite there yet..

dfa['MeterReading'] = [x if y>0 & y<z else 0 for x,y,z in 
                       zip(dfa['Meter'],dfa['eng_hours_diff'], dfa['date_hours_diff'])]

EDIT 2:

I'm much closer thanks to Bill's answer.

Applying this function will replace any record that doesn't meet the criteria with a zero. Then I'm replacing those zeros with np.nan and using the interpolate method.

The only thing that I'm missing is how to fill out the last values when they also come as np.nan, I'm looking to see if there's an extrapolate method.

Here is the function in case anyone stumbles upon a similar problem in the future:

dfa['MeterReading'] = dfa['MeterReading'].replace({0:np.nan}).interpolate(method='polynomial', order=2, limit=5, limit_direction='both').bfill()

This is the issue that I'm having at the end. Two values were missed but since the difference becomes negative it discards all 4.

enter image description here

Upvotes: 0

Views: 71

Answers (1)

Bill
Bill

Reputation: 11656

One problem with your code is that the logic condition is not doing what you want I think. y>0 & y<z is not the same as (y>0) & (y<z) (e.g. for the first row).

Putting that aside, there are in general three ways to do operations on the elements of rows in a pandas Dataframe.

For simple cases like yours where the operations are vectorizable you can do them without a for loop or list comprehension:

dfa['MeterReading'] = dfa['Meter']
condition = (dfa['eng_hours_diff'] > 0) & (dfa['eng_hours_diff'] < dfa['date_hours_diff'])
dfa.loc[~condition, 'MeterReading'] = 0

For more complex logic, you can use a for loop like this:

dfa['MeterReading'] = 0
for i, row in dfa.iterrows():
    if (row['eng_hours_diff'] > 0) & (row['eng_hours_diff'] < row['date_hours_diff']):
        dfa.loc[i, 'MeterReading'] = row['Meter']

Or, use apply with a custom function like:

def calc_meter_reading(row):
    if (row['eng_hours_diff'] > 0) & (row['eng_hours_diff'] < row['date_hours_diff']):
        return row['Meter']
    else:
        return 0

dfa['MeterReading'] = dfa.apply(calc_meter_reading, axis=1)

Upvotes: 1

Related Questions