Reputation: 355
Sorry for the confusing title. I'm trying to clean up a dataset that has engine hours reported on different time intervals. I'm trying to detect and address two situations:
Engine hours reported are less than last records engine hours
Engine hours reported between two dates is greater than the hour difference between said dates
Sampled Date Meter eng_hours_diff date_hours_diff
2017-02-02 5336 24 24
2017-02-20 5578 242 432
2017-02-22 5625 47 48
2017-03-07 5930 305 312
2017-05-16 6968 1038 1680
2017-06-01 7182 214 384
2017-06-22 7527 345 504
2017-07-10 7919 392 432
2017-07-25 16391 8472 360
2017-08-20 8590 -7801 624
2017-09-05 8827 237 384
2017-09-26 9106 279 504
2017-10-16 9406 300 480
2017-10-28 9660 254 288
2017-11-29 10175 515 768
What I would like to do is re-write the ['Meter'] series if either of the two scenarios above come up and take the average between the points around it.
I'm thinking that this might require two steps, one to eliminate any inaccuracy due to the difference in engine hours being > than the hours between the dates, and then re-calculate the ['eng_hours_diff'] column and check if there are still any that are negative.
The last two columns I've calculated like this:
dfa['eng_hours_diff'] = dfa['Meter'].diff().fillna(24)
dfa['date_hours_diff'] = dfa['Sampled Date'].diff().apply(lambda x:str(x)).apply(lambda x: x.split(' ')[0]).apply(lambda x:x.replace('NaT',"1")).apply(lambda x: int(x)*24)
EDIT: Ok I think I'm getting somewhere but not quite there yet..
dfa['MeterReading'] = [x if y>0 & y<z else 0 for x,y,z in
zip(dfa['Meter'],dfa['eng_hours_diff'], dfa['date_hours_diff'])]
EDIT 2:
I'm much closer thanks to Bill's answer.
Applying this function will replace any record that doesn't meet the criteria with a zero. Then I'm replacing those zeros with np.nan and using the interpolate method.
The only thing that I'm missing is how to fill out the last values when they also come as np.nan, I'm looking to see if there's an extrapolate method.
Here is the function in case anyone stumbles upon a similar problem in the future:
dfa['MeterReading'] = dfa['MeterReading'].replace({0:np.nan}).interpolate(method='polynomial', order=2, limit=5, limit_direction='both').bfill()
This is the issue that I'm having at the end. Two values were missed but since the difference becomes negative it discards all 4.
Upvotes: 0
Views: 71
Reputation: 11656
One problem with your code is that the logic condition is not doing what you want I think. y>0 & y<z
is not the same as (y>0) & (y<z)
(e.g. for the first row).
Putting that aside, there are in general three ways to do operations on the elements of rows in a pandas Dataframe.
For simple cases like yours where the operations are vectorizable you can do them without a for loop or list comprehension:
dfa['MeterReading'] = dfa['Meter']
condition = (dfa['eng_hours_diff'] > 0) & (dfa['eng_hours_diff'] < dfa['date_hours_diff'])
dfa.loc[~condition, 'MeterReading'] = 0
For more complex logic, you can use a for loop like this:
dfa['MeterReading'] = 0
for i, row in dfa.iterrows():
if (row['eng_hours_diff'] > 0) & (row['eng_hours_diff'] < row['date_hours_diff']):
dfa.loc[i, 'MeterReading'] = row['Meter']
Or, use apply
with a custom function like:
def calc_meter_reading(row):
if (row['eng_hours_diff'] > 0) & (row['eng_hours_diff'] < row['date_hours_diff']):
return row['Meter']
else:
return 0
dfa['MeterReading'] = dfa.apply(calc_meter_reading, axis=1)
Upvotes: 1