bgame2498
bgame2498

Reputation: 4847

Python: Quickly loop through np.array

I have a 1-D np array with over 150 million data points, it is filled using np.fromfile on a binary data file.

Given that array, I need to add a value 'val' to every point unless that point equals 'x'.

Further, every value in the array, depending on its value will correspond to another value that I want to store in another list.

Explanation of variables:

** temps=np.arange(-30.00,0.01,0.01, dtype='float32')

** slr is a list, index 0 in temps corresponds to index 0 in slr and so on. Both lists are the same length

Here is my current code:

import sys
import numpy as np

with open("file.dat", "rb") as f:
array = np.fromfile(f, dtype=np.float32)
f.close()

#This is the process below that I need to speed up 

T_SLR = np.array(np.zeros(len(array), dtype='Float64'))
for i in range(0,len(array)):
    if array[i] != float(-9.99e+08):
        array[i] = array[i] - 273.15     
    if array[i] in temps:
        index, = np.where(temps==array[i])[0]
        T_SLR = slr[index]
    else:
        T_SLR[i] = 0.00

Upvotes: 2

Views: 817

Answers (4)

Frank M
Frank M

Reputation: 1580

Since your selection criteria seem to be point-by-point, there's no reason you need to read in all 150 million points at once. You can use the count parameter on np.fromfile to limit the size of the arrays you compare at a time. Once you process in chunks larger than a few thousand, the for looping won't matter and you won't exercise your memory with huge arrays derived from all 150 million points.

slr and temps looks like an indexed translation table. You could probably replace the search on temps with a floating comparison and a computed lookup. Since -9.99e+8 is clearly outside the search criterion, you don't need any special treatment for those points.

f = open("file.dat", "rb")
N = 10000
T_SLR = np.zeros(size_of_TMPprs/4, dtype=np.float64)
t_off = 0
array = np.fromfile(f, count=N, dtype=np.float32)
while array.size > 0:
   array -= 273.15
   index = np.where((array >= -30) & (array <= 0))[0]
   T_SLR[t_off+index] = slr[np.round((array[index]+30)*100)]
   t_off += array.size
   array = np.fromfile(f, count=N, dtype=np.float32)

You can simplify this even more if you want T_SLR to contain the last entry in slr when the measured values is over zero. Then, you can use

array = np.maximum(np.minimum(array, 0), -30)

to limit the range of values in array, and just use it for the computed index into slr as above (no use of where in this case).

Upvotes: 0

Jaime
Jaime

Reputation: 67427

Because temps is sorted, you can use np.searchsorted and avoid all explicit loops:

array[array != float(-9.99e+08)] -= 273.15
indices = np.searchsorted(temps, array)
# Remove indices out of bounds
mask = indices < array.shape[0]
# Remove in-bounds indices not matching exactly
mask[mask] &= temps[indices[mask]] != array[mask]
T_SLR = np.where(mask, slr[indices[mask]], 0)

Upvotes: 0

hpaulj
hpaulj

Reputation: 231395

When using with open, don't close it yourself. The with context does that automtically. I'm also changing the generic array name with something that has less risk of shadowing something else (like np.array?)

with open("file.dat", "rb") as f:
    data = np.fromfile(f, dtype=np.float32)

First no need to wrap np.zeros in np.array. It already is an array. len(data) is ok if data is 1d, but I prefer to work the shape tuple.

T_SLR = np.zeros(data.shape, dtype='Float64')

Boolean indexing/masking lets you act on the whole array at once:

mask = data != -9.99e8   # don't need `float` here
                         # using != test with floats is poor idea
data[mask] -= 273.15

I need to refine the != test. It is ok for integers, but not for floats. Something like np.abs(data+9.99e8)>1 is better

Similarly in is not a good test with floats. And with integers, the in and where perform redundant work.

Assuming temps is 1d, the np.where(...) returns a 1 element tuple. [0] selects that element, returning an array. The , is then redundant in index,. index, = np.where() without the [0] should have worked.

T_SLR[i] is already 0 by how the array was initialized. No need to set it again.

for i in range(0,len(array)):
    if array[i] in temps:
        index, = np.where(temps==array[i])[0]
        T_SLR = slr[index]
    else:
        T_SLR[i] = 0.00

But I think we can get rid of this iteration as well. But I'll leave that discussion for later.


In [461]: temps=np.arange(-30.00,0.01,0.01, dtype='float32')
In [462]: temps
Out[462]: 
array([ -3.00000000e+01,  -2.99899998e+01,  -2.99799995e+01, ...,
        -1.93138123e-02,  -9.31358337e-03,   6.86645508e-04], dtype=float32)
In [463]: temps.shape
Out[463]: (3001,)

No wonder doing array[i] in temps and np.where(temps==array[i]) is slow

We can cut out the in with a look at the where

In [464]: np.where(temps==12.34)
Out[464]: (array([], dtype=int32),)
In [465]: np.where(temps==temps[3])
Out[465]: (array([3], dtype=int32),)

If there isn't a match where returns an empty array.

In [466]: idx,=np.where(temps==temps[3])
In [467]: idx.shape
Out[467]: (1,)
In [468]: idx,=np.where(temps==123.34)
In [469]: idx.shape
Out[469]: (0,)

in can be faster than where if the match is early in the list, but as slow, if not more so, it the match is at then end, or there is no match.

In [478]: timeit np.where(temps==temps[-1])[0].shape[0]>0
10000 loops, best of 3: 35.6 µs per loop
In [479]: timeit temps[-1] in temps
10000 loops, best of 3: 39.9 µs per loop

A rounding approach:

In [487]: (np.round(temps,2)/.01).astype(int)
Out[487]: array([-3000, -2999, -2998, ...,    -2,    -1,     0])

I'd suggest tweaking:

T_SLR = -np.round(data, 2)/.01).astype(int)

Upvotes: 0

eph
eph

Reputation: 2028

The slowest point in your code is the O(n) traversal of list in:

if array[i] in temps:
    index, = np.where(temps==array[i])[0]

Since temps is not large, you can convert it to dict:

temps2 = dict(zip(temps, range(len(temps)))

And make it O(1):

if array[i] in temps2:
    index = temps2[array[i]]

You can also try to avoid for loop to speed up. For example, the following code:

for i in range(0,len(array)):
    if array[i] != float(-9.99e+08):
        array[i] = array[i] - 273.15

Can be done as:

array[array!=float(-9.99e+08)] -= 273.15

Another problem in your code is the float comparation. You shoud not use exactly equal operators == or !=, try numpy.isclose with a tolerance, or convert float to int by multiplying 100.

Upvotes: 2

Related Questions