David
David

Reputation: 1

Delete rows from a pandas dataframe that have duplicate DatetimeIndex values using several condition

I have many dataframes with duplicate DatetimeIndex values. The example below shows the dataframe for id = 'D17'. I need to remove all duplicates based on several conditions.

df = pd.DataFrame(
    {"id": ['D17', 'D17', 'D17', 'D17', 'D17', 'D17', 'D17', 'D17', 'D17'],
     "value": [0.82, 0.89, 0, 0.94, 1.02, 0.97, 0.83, 0.83, 0.72]}, index=pd.DatetimeIndex(
        [pd.Timestamp('2012-10-23 00:00:00'), pd.Timestamp('2012-11-01 00:00:00'), pd.Timestamp('2012-11-01 00:00:00'),
         pd.Timestamp('2012-12-03 00:00:00'), pd.Timestamp('2012-12-17 00:00:00'), pd.Timestamp('2013-01-17 00:00:00'),
         pd.Timestamp('2013-01-18 00:00:00'), pd.Timestamp('2013-01-18 00:00:00'), pd.Timestamp('2013-01-24 00:00:00')])
)
print(df)

             id  value
2012-10-23  D17   0.82
2012-11-01  D17   0.89
2012-11-01  D17   0.00
2012-12-03  D17   0.94
2012-12-17  D17   1.02
2013-01-17  D17   0.97
2013-01-18  D17   0.83
2013-01-18  D17   0.83
2013-01-24  D17   0.72

For this id, the duplicate DatetimeIndex values are 2012-11-01 and 2013-01-18.

The corresponding 2013-01-18 rows all have the same 'value' = 0.83. In this case, I need to remove one of the two rows.

The corresponding 2012-11-01 rows do not have the same 'value'. In this case, I need to keep the row with the value that has the smallest absolute deviation to the average of the preceding and following values. Here, this average would be (0.82 + 0.94) / 2 = 0.88 and the deviations abs(0.89 - 0.88) = 0.01 and abs(0 - 0.88) = 0.88. Thus, the row with the value 0 shall be dropped.

The final output would then look like this

             id  value
2012-10-23  D17   0.82
2012-11-01  D17   0.89
2012-12-03  D17   0.94
2012-12-17  D17   1.02
2013-01-17  D17   0.97
2013-01-18  D17   0.83
2013-01-24  D17   0.72

My unsuccessful approach so far has been to identify the duplicates via df.index.duplicated(keep=False) and going through all duplicate pairs using a for loop and trying to get the location via df.index.get_loc() in order to drop the relevant row and ending up trying to manipulate slice objects that seem to be not working well with the DatetimeIndex. I am, on purpose, not posting my code snippets here, as I believe (after a significant amount of time) that this approach is very likely not the best one. Any help how to tackle this issue is highly appreciated.

The duplicates can also occur at the beginning and the end of the dataframe like in this example:

df = pd.DataFrame(
    {"id": ['D18', 'D18', 'D18', 'D18', 'D18', 'D18', 'D18', 'D18', 'D18', 'D18'],
     "value": [0, 0.89, 0.94, 1.02, 0.97, 0.97, 1.05, 0.98, 0.81, 0.83]}, index=pd.DatetimeIndex(
        [pd.Timestamp('2012-11-01 00:00:00'), pd.Timestamp('2012-11-01 00:00:00'),
         pd.Timestamp('2012-12-03 00:00:00'), pd.Timestamp('2012-12-17 00:00:00'),
         pd.Timestamp('2013-01-08 00:00:00'), pd.Timestamp('2013-01-08 00:00:00'),
         pd.Timestamp('2013-01-12 00:00:00'), pd.Timestamp('2013-01-17 00:00:00'),
         pd.Timestamp('2013-01-18 00:00:00'), pd.Timestamp('2013-01-18 00:00:00')])
)
print(df)

            id  value
2012-11-01  D18   0.00
2012-11-01  D18   0.89
2012-12-03  D18   0.94
2012-12-17  D18   1.02
2013-01-08  D18   0.97
2013-01-08  D18   0.97
2013-01-12  D18   1.05
2013-01-17  D18   0.98
2013-01-18  D18   0.81
2013-01-18  D18   0.83

Upvotes: 0

Views: 204

Answers (1)

mozway
mozway

Reputation: 261840

Calculate first the mean of the previous and next group's values. Then calculate the deviation. Sort by "deviation", group by index, and keep the first (=smallest):

means = df['value'].groupby(level=0).mean()
means = (means.shift()+means.shift(-1)).div(2)
df['deviation'] = (df['value']-means).abs()
df.sort_values(by='deviation').groupby(level=0).first().sort_index()

output:

             id  value  deviation
2012-10-23  D17   0.82        NaN
2012-11-01  D17   0.89     0.0100
2012-12-03  D17   0.94     0.2075
2012-12-17  D17   1.02     0.0650
2013-01-17  D17   0.97     0.0450
2013-01-18  D17   0.83     0.0150
2013-01-24  D17   0.72        NaN

Upvotes: 1

Related Questions