some_programmer
some_programmer

Reputation: 3528

How to delete certain values in a column after Boolean indexing?

I have a df as follows:

dates   values
2020-03-29 00:30:00 86.824
2020-03-29 00:45:00 86.923
2020-03-29 01:00:00 87.222
2020-03-29 01:15:00 87.52
2020-03-29 01:30:00 87.918
2020-03-29 01:45:00 88.415
2020-03-29 02:00:00 89.012
2020-03-29 02:15:00 89.807
2020-03-29 02:30:00 90.504
2020-03-29 02:45:00 90.702
2020-03-29 03:00:00 90.205
2020-03-29 03:15:00 88.713
2020-03-29 03:30:00 86.724
2020-03-29 03:45:00 84.636
2020-03-29 04:00:00 83.044
2020-03-29 04:15:00 82.249

I am considering Daylight saving and for that, I have to delete just the dates in the dates column and the values in the values column should not be deleted.

Since the daylight saving every year is on the last Sunday of march, I did the following to obtain the date of the last Sunday in march:

import datetime
currentYear = datetime.datetime.now().strftime("%Y")
march_last_day = currentYear + '-03-31'
march_last_day = datetime.datetime.strptime(march_last_day, '%Y-%m-%d')

ZEIT_UMSTELLUNG_START = march_last_day - \
    datetime.timedelta((march_last_day.weekday() + 1) % 7)
ZEIT_UMSTELLUNG_START = ZEIT_UMSTELLUNG_START.replace(hour=2, minute=00)
ZEIT_UMSTELLUNG_START_2 = ZEIT_UMSTELLUNG_START + datetime.timedelta(hours=1)

values:

ZEIT_UMSTELLUNG_START.replace(hour=2, minute=00)
Out[13]: datetime.datetime(2020, 3, 29, 2, 0)
ZEIT_UMSTELLUNG_START + datetime.timedelta(hours=1)
Out[14]: datetime.datetime(2020, 3, 29, 3, 0)

Now I want to delete the dates that lie between ZEIT_UMSTELLUNG_START (included) and ZEIT_UMSTELLUNG_START_2 (excluded) from the dates column and the next values should be 'shifted' up, but the values in the values column shouldn't be disturbed.

I did:

mask = (df['dates'] >= ZEIT_UMSTELLUNG_START) & (df['dates'] < ZEIT_UMSTELLUNG_START_2)
df = df.loc[mask]

But it deletes both the columns.

Desired output

dates   values
2020-03-29 00:30:00 86.824
2020-03-29 00:45:00 86.923
2020-03-29 01:00:00 87.222
2020-03-29 01:15:00 87.52
2020-03-29 01:30:00 87.918
2020-03-29 01:45:00 88.415
2020-03-29 03:00:00 89.012
2020-03-29 03:15:00 89.807
2020-03-29 03:30:00 90.504
2020-03-29 03:45:00 90.702
2020-03-29 04:00:00 90.205
2020-03-29 04:15:00 88.713
         NaT        86.724
         NaT        84.636
         NaT        83.044
         NaT        82.249

If you compare my desired output with the original df, you can see that the timestamps between 02:00:00 and 03:00:00 have been deleted and the other columns have been just pushed or shifted up. After shifting, NaN in the last 4 rows of the dates column can be placed

Upvotes: 1

Views: 38

Answers (1)

Vishnudev Krishnadas
Vishnudev Krishnadas

Reputation: 10960

Try

df.loc[mask, 'dates'] = pd.NaT
df['dates'] = df['dates'].sort_values(ascending=True).tolist()

Upvotes: 1

Related Questions