Delete rows from a pandas dataframe that have duplicate DatetimeIndex values using several condition

Question

I have many dataframes with duplicate DatetimeIndex values. The example below shows the dataframe for id = 'D17'. I need to remove all duplicates based on several conditions.

df = pd.DataFrame(
    {"id": ['D17', 'D17', 'D17', 'D17', 'D17', 'D17', 'D17', 'D17', 'D17'],
     "value": [0.82, 0.89, 0, 0.94, 1.02, 0.97, 0.83, 0.83, 0.72]}, index=pd.DatetimeIndex(
        [pd.Timestamp('2012-10-23 00:00:00'), pd.Timestamp('2012-11-01 00:00:00'), pd.Timestamp('2012-11-01 00:00:00'),
         pd.Timestamp('2012-12-03 00:00:00'), pd.Timestamp('2012-12-17 00:00:00'), pd.Timestamp('2013-01-17 00:00:00'),
         pd.Timestamp('2013-01-18 00:00:00'), pd.Timestamp('2013-01-18 00:00:00'), pd.Timestamp('2013-01-24 00:00:00')])
)
print(df)

             id  value
2012-10-23  D17   0.82
2012-11-01  D17   0.89
2012-11-01  D17   0.00
2012-12-03  D17   0.94
2012-12-17  D17   1.02
2013-01-17  D17   0.97
2013-01-18  D17   0.83
2013-01-18  D17   0.83
2013-01-24  D17   0.72

For this id, the duplicate DatetimeIndex values are 2012-11-01 and 2013-01-18.

The corresponding 2013-01-18 rows all have the same 'value' = 0.83. In this case, I need to remove one of the two rows.

The corresponding 2012-11-01 rows do not have the same 'value'. In this case, I need to keep the row with the value that has the smallest absolute deviation to the average of the preceding and following values. Here, this average would be (0.82 + 0.94) / 2 = 0.88 and the deviations abs(0.89 - 0.88) = 0.01 and abs(0 - 0.88) = 0.88. Thus, the row with the value 0 shall be dropped.

The final output would then look like this

             id  value
2012-10-23  D17   0.82
2012-11-01  D17   0.89
2012-12-03  D17   0.94
2012-12-17  D17   1.02
2013-01-17  D17   0.97
2013-01-18  D17   0.83
2013-01-24  D17   0.72

My unsuccessful approach so far has been to identify the duplicates via df.index.duplicated(keep=False) and going through all duplicate pairs using a for loop and trying to get the location via df.index.get_loc() in order to drop the relevant row and ending up trying to manipulate slice objects that seem to be not working well with the DatetimeIndex. I am, on purpose, not posting my code snippets here, as I believe (after a significant amount of time) that this approach is very likely not the best one. Any help how to tackle this issue is highly appreciated.

The duplicates can also occur at the beginning and the end of the dataframe like in this example:

df = pd.DataFrame(
    {"id": ['D18', 'D18', 'D18', 'D18', 'D18', 'D18', 'D18', 'D18', 'D18', 'D18'],
     "value": [0, 0.89, 0.94, 1.02, 0.97, 0.97, 1.05, 0.98, 0.81, 0.83]}, index=pd.DatetimeIndex(
        [pd.Timestamp('2012-11-01 00:00:00'), pd.Timestamp('2012-11-01 00:00:00'),
         pd.Timestamp('2012-12-03 00:00:00'), pd.Timestamp('2012-12-17 00:00:00'),
         pd.Timestamp('2013-01-08 00:00:00'), pd.Timestamp('2013-01-08 00:00:00'),
         pd.Timestamp('2013-01-12 00:00:00'), pd.Timestamp('2013-01-17 00:00:00'),
         pd.Timestamp('2013-01-18 00:00:00'), pd.Timestamp('2013-01-18 00:00:00')])
)
print(df)

            id  value
2012-11-01  D18   0.00
2012-11-01  D18   0.89
2012-12-03  D18   0.94
2012-12-17  D18   1.02
2013-01-08  D18   0.97
2013-01-08  D18   0.97
2013-01-12  D18   1.05
2013-01-17  D18   0.98
2013-01-18  D18   0.81
2013-01-18  D18   0.83

Delete rows from a pandas dataframe that have duplicate DatetimeIndex values using several condition

Answers (1)

Related Questions