CptSnuggles
CptSnuggles

Reputation: 137

Apply DatetimeIndex as filter on a Dateframe with datetime values

Ok, so I'm just learning to work with DatetimeIndex and Dateframe objects. I came across a new problem I can't directly see the solution to and I was hoping someone maybe has an elegant solution to it using pandas functions I maybe don't know of yet.

The situation is as follows: On one hand, I have a very huge Dataframe with many rows and several columns, including a column called starttime which has timestamps as values. There might be two or more rows with the same starttime value.

                  starttime                 endtime  ...          y          x
id                                                   ...                      
0    2015-10-11 00:00:55+00  2015-10-11 00:00:55+00  ...          1      other
1    2015-10-11 15:10:42+00  2015-10-11 15:10:42+00  ...          1      other
2    2014-10-21 10:25:44+00  2014-10-21 10:25:44+00  ...          1      other
3    2014-10-21 10:27:28+00  2014-10-21 10:27:28+00  ...          1      other
4    2014-10-21 10:30:27+00  2014-10-21 10:30:27+00  ...          1      other
..                      ...                     ...  ...        ...        ...

On the other hand, I have a DatetimeIndex object, containing a lot of disjoint dates. It is very important to know, that these dates don't form a complete range between date A and B, so there are definitely "holes" in between, so I can't simply apply a date_range.

DatetimeIndex(['2014-12-12', '2014-12-15', '2014-12-16', '2014-12-17',
               '2014-12-18', '2014-12-19', '2014-12-20', '2014-12-21',
               '2015-03-02', '2015-03-03',
               ...],
              dtype='datetime64[ns]', length=xyz, freq=None)

And here goes the question: What I need now is to drop all rows of the Dataframe inplace which starttime value is not represented with a date in the DatetimeIndex. The time in h:m:s is irrelevant, so if I have a date "2014-12-12" and two rows "2014-12-12 00:00:55+00" and "2014-12-12 15:10:42+00" both should be included. The resulting pruned Dataframe should also still contain all columns it had before.

My first iterative approach was to take one date of the DatetimeIndex after the other and go through all rows of the Dataframe and copy out the row if it's on the same day into a new Frame, but I thought there must be a better way for this, since I'm obviously getting into serious performance issues if the Dataframe has too many rows.

Upvotes: 1

Views: 1021

Answers (1)

jezrael
jezrael

Reputation: 862406

Remove times by Series.dt.floor, compare by Series.isin and filter by boolean indexing:

#some value for match
idx = pd.DatetimeIndex(['2015-03-02', '2015-10-11'])

df['starttime'] = pd.to_datetime(df['starttime'])

df1 = df[df['starttime'].dt.floor('D').isin(idx)]
print (df1)
   id                 starttime                 endtime  y      x
0   0 2015-10-11 00:00:55+00:00  2015-10-11 00:00:55+00  1  other
1   1 2015-10-11 15:10:42+00:00  2015-10-11 15:10:42+00  1  other

Details:

print (df['starttime'].dt.floor('D'))
0   2015-10-11 00:00:00+00:00
1   2015-10-11 00:00:00+00:00
2   2014-10-21 00:00:00+00:00
3   2014-10-21 00:00:00+00:00
4   2014-10-21 00:00:00+00:00
Name: starttime, dtype: datetime64[ns, UTC]

print (df['starttime'].dt.floor('D').isin(idx))
0     True
1     True
2    False
3    False
4    False
Name: starttime, dtype: bool

Upvotes: 1

Related Questions