Josh Friedlander
Josh Friedlander

Reputation: 11657

Efficient way to check dtype of each row in a series

Say I have mixed ts/other data:

ser = pd.Series(pd.date_range('2017/01/05', '2018/01/05'))
ser.loc[3] = 4
type(ser.loc[0])
> pandas._libs.tslibs.timestamps.Timestamp

I would like to filter for all timestamps. For instance, this gives me what I want:

ser.apply(lambda x: isinstance(x, pd.Timestamp))

0       True
1       True
2       True
3      False
4       True
...

But I assume it would be faster to use a vectorized solution and avoid apply. I thought I should be able to use where:

ser.where(isinstance(ser, pd.Timestamp))

But I get

ValueError: Array conditional must be same shape as self

Is there a way to do this? Also, am I correct in my assumption that it would be faster/more 'Pandasic'?

Upvotes: 4

Views: 388

Answers (2)

jezrael
jezrael

Reputation: 863166

It depends of length of data, but here for small data (365 rows) is faster list comprehension:

In [108]: %timeit (ser.apply(lambda x: isinstance(x, pd.Timestamp)))
434 µs ± 57.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [109]: %timeit ([isinstance(x, pd.Timestamp) for x in ser])
140 µs ± 5.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [110]: %timeit (pd.to_datetime(ser, errors='coerce').notna())
1.01 ms ± 25.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But if test larger DataFrame is faster to_datetime with test non missing values by Series.isna:

ser = pd.Series(pd.date_range('1980/01/05', '2020/01/05'))
ser.loc[3] = 4

print (len(ser))
14611

In [116]: %timeit (ser.apply(lambda x: isinstance(x, pd.Timestamp)))
6.42 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [117]: %timeit ([isinstance(x, pd.Timestamp) for x in ser])
4.9 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [118]: %timeit (pd.to_datetime(ser, errors='coerce').notna())
4.22 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 4

cs95
cs95

Reputation: 402814

To address your question of filtering, you can convert to datetime and drop NaNs.

ser[pd.to_datetime(ser, errors='coerce').notna()]

Or, if you don't mind the result being datetime,

pd.to_datetime(ser, errors='coerce').dropna()

Upvotes: 1

Related Questions