Shu Pan
Shu Pan

Reputation: 305

find non-monotonical rows in dataframe

I have a pandas dataframe with Datetime as index. The index is generally monotonically increasing however there seem to be a few rows don't follow this tread. Any quick way to identify these unusual rows?

Upvotes: 4

Views: 2332

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210882

Consider the following demo:

In [156]: df
Out[156]:
                 val
2017-01-01  0.889887
2017-01-02  0.838433
2017-01-03  0.977659
2017-01-04  0.750143
2017-01-05  0.271435
1970-01-01  0.138332    # <---- !!!
2017-01-07  0.673203
2017-01-08  0.497589
1999-01-01  0.592959    # <---- !!!
2017-01-10  0.818760

In [157]: df.loc[df.index.to_series().diff() < pd.to_timedelta('0 seconds')]
Out[157]:
                 val
1970-01-01  0.138332
1999-01-01  0.592959

In [158]: df.index.to_series().diff() < pd.to_timedelta('0 seconds')
Out[158]:
2017-01-01    False
2017-01-02    False
2017-01-03    False
2017-01-04    False
2017-01-05    False
1970-01-01     True
2017-01-07    False
2017-01-08    False
1999-01-01     True
2017-01-10    False
dtype: bool

In [159]: df.index.to_series().diff()
Out[159]:
2017-01-01           NaT
2017-01-02        1 days
2017-01-03        1 days
2017-01-04        1 days
2017-01-05        1 days
1970-01-01   -17171 days
2017-01-07    17173 days
2017-01-08        1 days
1999-01-01    -6582 days
2017-01-10     6584 days
dtype: timedelta64[ns]

Upvotes: 7

Prune
Prune

Reputation: 77850

"Quick" in terms of what resource? If you want programming ease, then simply make a new frame resulting from subtracting adjacent columns. Any entry of zero or negative value is your target.

If you need execution speed, do note that adjacent differences are still necessary: all you can save is the overhead of finding multiple violations in a given row. However, unless you have a particularly wide data frame, it's likely that you'll lose more in short-circuiting than you'll gain by the saved subtractions. Also note that a processor with matrix operations or other parallelism will be fast enough with the whole data frame, that the checking will cost you significant time.

Upvotes: 0

Related Questions