maynull
maynull

Reputation: 2046

How to extract rows from a numpy array, that meet several conditions?

I want to extract rows that meet several conditions from another array.

This is what the original array looks like:

original = array([[Timestamp('2018-01-15 01:59:00'), 329, 30, 5],
                  [Timestamp('2018-01-15 01:59:00'), 326, 25, 3],
                  [Timestamp('2018-01-15 02:00:00'), 324, 22, 34],
                  ..., 
                  [Timestamp('2018-01-15 21:57:00'), 322, 23, 3],
                  [Timestamp('2018-01-15 21:57:00'), 323, 30, 9],
                  [Timestamp('2018-01-15 21:59:00'), 323, 1, 19]], dtype=object)

The conditions are:
1) Either the 3rd or 4th value is bigger than 25.
2) Either the 3rd or 4th value is twice bigger than the other value.
3) The values are received between 01:00~06:00

So, according to the conditions, the first row will be extracted. (30 is bigger than 25 | 30 is more than twice bigger than 5 | the row was made at 01:59:00, which is between 01:00 ~ 06:00)

Is it possible to do this only with np.where?

Edit: I could do the job with pandas.

>>> df_text = pd.DataFrame( trade_reset , columns=['date', 'freq', 'in', 'out'])

>>> df_text = df_text[(df_text['in'] >= 30 ) | (df_text['out'] >= 30 )]

>>> df_text = df_text[(df_text['in'] > df_text['out']*2 ) | (df_text['out'] >= df_text['in']*2 )]

>>> df_text[ (df_text['date'] < datetime(2018, 1, 15, 6)) & (df_text['date'] > datetime(2018, 1, 15, 1)) ]

Upvotes: 1

Views: 195

Answers (1)

hpaulj
hpaulj

Reputation: 231355

For convenience, define Timestamp as a np.datetie64 creator:

In [492]: Timestamp=lambda x: np.datetime64(x, 's')
In [493]: Timestamp('2018-01-15 01:59:00')
Out[493]: numpy.datetime64('2018-01-15T01:59:00')
In [494]: original = np.array([[Timestamp('2018-01-15 01:59:00'), 329, 30, 5],
     ...:                   [Timestamp('2018-01-15 01:59:00'), 326, 25, 3],
     ...:                   [Timestamp('2018-01-15 02:00:00'), 324, 22, 34],
     ...:                   [Timestamp('2018-01-15 21:57:00'), 322, 23, 3],
     ...:                   [Timestamp('2018-01-15 21:57:00'), 323, 30, 9],
     ...:                   [Timestamp('2018-01-15 21:59:00'), 323, 1, 19]], dty
     ...: pe=object)
     ...:                   
In [495]: original
Out[495]: 
array([[numpy.datetime64('2018-01-15T01:59:00'), 329, 30, 5],
       [numpy.datetime64('2018-01-15T01:59:00'), 326, 25, 3],
       [numpy.datetime64('2018-01-15T02:00:00'), 324, 22, 34],
       [numpy.datetime64('2018-01-15T21:57:00'), 322, 23, 3],
       [numpy.datetime64('2018-01-15T21:57:00'), 323, 30, 9],
       [numpy.datetime64('2018-01-15T21:59:00'), 323, 1, 19]],
      dtype=object)

Now we can to the time test with:

In [500]: original[:,0]<Timestamp('2018-01-15 06:00:00')
Out[500]: array([ True,  True,  True, False, False, False])
In [501]: original[:,0]>Timestamp('2018-01-15 01:00:00')
Out[501]: array([ True,  True,  True,  True,  True,  True])
In [502]: mask = Out[500] & Out[501]
In [503]: mask
Out[503]: array([ True,  True,  True, False, False, False])

Test on columns 2&3

In [509]: (original[:,[2,3]]>=30).any(axis=1)
Out[509]: array([ True, False,  True, False,  True, False])

and

In [506]: (original[:,2]>(original[:,3]*2)) | (original[:,3]>=(original[:,2]*2))
     ...: 
Out[506]: array([ True,  True, False,  True,  True,  True])

and together

In [510]: mask & Out[509] & Out[506]
Out[510]: array([ True, False, False, False, False, False])
In [511]: np.where(Out[510])
Out[511]: (array([0]),)

Sometimes object dtype hinders calculations, usually it a function can't delegate the task to methods of the objects. Here the Python integers can be compared, so object arrays can also be compared. In a large array these comparisons might be faster if part of the array was first converted to a 2d numeric array.

In [512]: original[:,1:].astype(int)
Out[512]: 
array([[329,  30,   5],
       [326,  25,   3],
       [324,  22,  34],
       [322,  23,   3],
       [323,  30,   9],
       [323,   1,  19]])

Pandas seems to be 'happier' dealing with object dtypes, but I think that flexibility comes at a speed cost.

Upvotes: 1

Related Questions