Reputation: 305
I have a pandas dataframe with Datetime as index. The index is generally monotonically increasing however there seem to be a few rows don't follow this tread. Any quick way to identify these unusual rows?
Upvotes: 4
Views: 2332
Reputation: 210882
Consider the following demo:
In [156]: df
Out[156]:
val
2017-01-01 0.889887
2017-01-02 0.838433
2017-01-03 0.977659
2017-01-04 0.750143
2017-01-05 0.271435
1970-01-01 0.138332 # <---- !!!
2017-01-07 0.673203
2017-01-08 0.497589
1999-01-01 0.592959 # <---- !!!
2017-01-10 0.818760
In [157]: df.loc[df.index.to_series().diff() < pd.to_timedelta('0 seconds')]
Out[157]:
val
1970-01-01 0.138332
1999-01-01 0.592959
In [158]: df.index.to_series().diff() < pd.to_timedelta('0 seconds')
Out[158]:
2017-01-01 False
2017-01-02 False
2017-01-03 False
2017-01-04 False
2017-01-05 False
1970-01-01 True
2017-01-07 False
2017-01-08 False
1999-01-01 True
2017-01-10 False
dtype: bool
In [159]: df.index.to_series().diff()
Out[159]:
2017-01-01 NaT
2017-01-02 1 days
2017-01-03 1 days
2017-01-04 1 days
2017-01-05 1 days
1970-01-01 -17171 days
2017-01-07 17173 days
2017-01-08 1 days
1999-01-01 -6582 days
2017-01-10 6584 days
dtype: timedelta64[ns]
Upvotes: 7
Reputation: 77850
"Quick" in terms of what resource? If you want programming ease, then simply make a new frame resulting from subtracting adjacent columns. Any entry of zero or negative value is your target.
If you need execution speed, do note that adjacent differences are still necessary: all you can save is the overhead of finding multiple violations in a given row. However, unless you have a particularly wide data frame, it's likely that you'll lose more in short-circuiting than you'll gain by the saved subtractions. Also note that a processor with matrix operations or other parallelism will be fast enough with the whole data frame, that the checking will cost you significant time.
Upvotes: 0