Reputation: 4847
I have a 1-D np array with over 150 million data points, it is filled using np.fromfile on a binary data file.
Given that array, I need to add a value 'val' to every point unless that point equals 'x'.
Further, every value in the array, depending on its value will correspond to another value that I want to store in another list.
Explanation of variables:
** temps=np.arange(-30.00,0.01,0.01, dtype='float32')
** slr is a list, index 0 in temps corresponds to index 0 in slr and so on. Both lists are the same length
Here is my current code:
import sys
import numpy as np
with open("file.dat", "rb") as f:
array = np.fromfile(f, dtype=np.float32)
f.close()
#This is the process below that I need to speed up
T_SLR = np.array(np.zeros(len(array), dtype='Float64'))
for i in range(0,len(array)):
if array[i] != float(-9.99e+08):
array[i] = array[i] - 273.15
if array[i] in temps:
index, = np.where(temps==array[i])[0]
T_SLR = slr[index]
else:
T_SLR[i] = 0.00
Upvotes: 2
Views: 817
Reputation: 1580
Since your selection criteria seem to be point-by-point, there's no reason you need to read in all 150 million points at once. You can use the count
parameter on np.fromfile
to limit the size of the arrays you compare at a time. Once you process in chunks larger than a few thousand, the for
looping won't matter and you won't exercise your memory with huge arrays derived from all 150 million points.
slr
and temps
looks like an indexed translation table. You could probably replace the search on temps
with a floating comparison and a computed lookup. Since -9.99e+8 is clearly outside the search criterion, you don't need any special treatment for those points.
f = open("file.dat", "rb")
N = 10000
T_SLR = np.zeros(size_of_TMPprs/4, dtype=np.float64)
t_off = 0
array = np.fromfile(f, count=N, dtype=np.float32)
while array.size > 0:
array -= 273.15
index = np.where((array >= -30) & (array <= 0))[0]
T_SLR[t_off+index] = slr[np.round((array[index]+30)*100)]
t_off += array.size
array = np.fromfile(f, count=N, dtype=np.float32)
You can simplify this even more if you want T_SLR
to contain the last entry in slr
when the measured values is over zero. Then, you can use
array = np.maximum(np.minimum(array, 0), -30)
to limit the range of values in array
, and just use it for the computed index into slr
as above (no use of where
in this case).
Upvotes: 0
Reputation: 67427
Because temps
is sorted, you can use np.searchsorted
and avoid all explicit loops:
array[array != float(-9.99e+08)] -= 273.15
indices = np.searchsorted(temps, array)
# Remove indices out of bounds
mask = indices < array.shape[0]
# Remove in-bounds indices not matching exactly
mask[mask] &= temps[indices[mask]] != array[mask]
T_SLR = np.where(mask, slr[indices[mask]], 0)
Upvotes: 0
Reputation: 231395
When using with open
, don't close it yourself. The with
context does that automtically. I'm also changing the generic array
name with something that has less risk of shadowing something else (like np.array
?)
with open("file.dat", "rb") as f:
data = np.fromfile(f, dtype=np.float32)
First no need to wrap np.zeros
in np.array
. It already is an array. len(data)
is ok if data
is 1d, but I prefer to work the shape
tuple.
T_SLR = np.zeros(data.shape, dtype='Float64')
Boolean indexing/masking lets you act on the whole array at once:
mask = data != -9.99e8 # don't need `float` here
# using != test with floats is poor idea
data[mask] -= 273.15
I need to refine the !=
test. It is ok for integers, but not for floats. Something like np.abs(data+9.99e8)>1
is better
Similarly in
is not a good test with floats. And with integers, the in
and where
perform redundant work.
Assuming temps
is 1d, the np.where(...)
returns a 1 element tuple. [0]
selects that element, returning an array. The ,
is then redundant in index,
. index, = np.where()
without the [0]
should have worked.
T_SLR[i]
is already 0 by how the array was initialized. No need to set it again.
for i in range(0,len(array)):
if array[i] in temps:
index, = np.where(temps==array[i])[0]
T_SLR = slr[index]
else:
T_SLR[i] = 0.00
But I think we can get rid of this iteration as well. But I'll leave that discussion for later.
In [461]: temps=np.arange(-30.00,0.01,0.01, dtype='float32')
In [462]: temps
Out[462]:
array([ -3.00000000e+01, -2.99899998e+01, -2.99799995e+01, ...,
-1.93138123e-02, -9.31358337e-03, 6.86645508e-04], dtype=float32)
In [463]: temps.shape
Out[463]: (3001,)
No wonder doing array[i] in temps
and np.where(temps==array[i])
is slow
We can cut out the in
with a look at the where
In [464]: np.where(temps==12.34)
Out[464]: (array([], dtype=int32),)
In [465]: np.where(temps==temps[3])
Out[465]: (array([3], dtype=int32),)
If there isn't a match where
returns an empty array.
In [466]: idx,=np.where(temps==temps[3])
In [467]: idx.shape
Out[467]: (1,)
In [468]: idx,=np.where(temps==123.34)
In [469]: idx.shape
Out[469]: (0,)
in
can be faster than where
if the match is early in the list, but as slow, if not more so, it the match is at then end, or there is no match.
In [478]: timeit np.where(temps==temps[-1])[0].shape[0]>0
10000 loops, best of 3: 35.6 µs per loop
In [479]: timeit temps[-1] in temps
10000 loops, best of 3: 39.9 µs per loop
A rounding approach:
In [487]: (np.round(temps,2)/.01).astype(int)
Out[487]: array([-3000, -2999, -2998, ..., -2, -1, 0])
I'd suggest tweaking:
T_SLR = -np.round(data, 2)/.01).astype(int)
Upvotes: 0
Reputation: 2028
The slowest point in your code is the O(n) traversal of list in:
if array[i] in temps:
index, = np.where(temps==array[i])[0]
Since temps
is not large, you can convert it to dict:
temps2 = dict(zip(temps, range(len(temps)))
And make it O(1):
if array[i] in temps2:
index = temps2[array[i]]
You can also try to avoid for
loop to speed up. For example, the following code:
for i in range(0,len(array)):
if array[i] != float(-9.99e+08):
array[i] = array[i] - 273.15
Can be done as:
array[array!=float(-9.99e+08)] -= 273.15
Another problem in your code is the float comparation. You shoud not use exactly equal operators ==
or !=
, try numpy.isclose
with a tolerance, or convert float to int by multiplying 100.
Upvotes: 2