Reputation: 137
Ok, so I'm just learning to work with DatetimeIndex and Dateframe objects. I came across a new problem I can't directly see the solution to and I was hoping someone maybe has an elegant solution to it using pandas functions I maybe don't know of yet.
The situation is as follows: On one hand, I have a very huge Dataframe with many rows and several columns, including a column called starttime which has timestamps as values. There might be two or more rows with the same starttime value.
starttime endtime ... y x
id ...
0 2015-10-11 00:00:55+00 2015-10-11 00:00:55+00 ... 1 other
1 2015-10-11 15:10:42+00 2015-10-11 15:10:42+00 ... 1 other
2 2014-10-21 10:25:44+00 2014-10-21 10:25:44+00 ... 1 other
3 2014-10-21 10:27:28+00 2014-10-21 10:27:28+00 ... 1 other
4 2014-10-21 10:30:27+00 2014-10-21 10:30:27+00 ... 1 other
.. ... ... ... ... ...
On the other hand, I have a DatetimeIndex object, containing a lot of disjoint dates. It is very important to know, that these dates don't form a complete range between date A and B, so there are definitely "holes" in between, so I can't simply apply a date_range.
DatetimeIndex(['2014-12-12', '2014-12-15', '2014-12-16', '2014-12-17',
'2014-12-18', '2014-12-19', '2014-12-20', '2014-12-21',
'2015-03-02', '2015-03-03',
...],
dtype='datetime64[ns]', length=xyz, freq=None)
And here goes the question: What I need now is to drop all rows of the Dataframe inplace which starttime value is not represented with a date in the DatetimeIndex. The time in h:m:s is irrelevant, so if I have a date "2014-12-12" and two rows "2014-12-12 00:00:55+00" and "2014-12-12 15:10:42+00" both should be included. The resulting pruned Dataframe should also still contain all columns it had before.
My first iterative approach was to take one date of the DatetimeIndex after the other and go through all rows of the Dataframe and copy out the row if it's on the same day into a new Frame, but I thought there must be a better way for this, since I'm obviously getting into serious performance issues if the Dataframe has too many rows.
Upvotes: 1
Views: 1021
Reputation: 862406
Remove times by Series.dt.floor
, compare by Series.isin
and filter by boolean indexing
:
#some value for match
idx = pd.DatetimeIndex(['2015-03-02', '2015-10-11'])
df['starttime'] = pd.to_datetime(df['starttime'])
df1 = df[df['starttime'].dt.floor('D').isin(idx)]
print (df1)
id starttime endtime y x
0 0 2015-10-11 00:00:55+00:00 2015-10-11 00:00:55+00 1 other
1 1 2015-10-11 15:10:42+00:00 2015-10-11 15:10:42+00 1 other
Details:
print (df['starttime'].dt.floor('D'))
0 2015-10-11 00:00:00+00:00
1 2015-10-11 00:00:00+00:00
2 2014-10-21 00:00:00+00:00
3 2014-10-21 00:00:00+00:00
4 2014-10-21 00:00:00+00:00
Name: starttime, dtype: datetime64[ns, UTC]
print (df['starttime'].dt.floor('D').isin(idx))
0 True
1 True
2 False
3 False
4 False
Name: starttime, dtype: bool
Upvotes: 1