Reputation: 238

Numpy and memory with big arrays

I have to work with big arrays, says for example x = np.arange(0, 750*350*365, dtype=np.int32)

I know python hold a variable in memory as long as it has at least one reference to it.

But lets say i have to import a big array, do some math on it, and save a smaller array computed from the big one. Would the big array still be on memory ?

For example :

Class Data:
    value = None

def process(myDataInstance):
    x = np.arange(0, 750*350*365, dtype=np.int32)
    ix = numpy.where(x < 50000)
    myDataInstance.value = x[ix]

d = Data()
process(d)

(in real life, i'm not creating array in the function but loading a file which contains large arrays, but this is for example purpose)

will be x still in memory even if we aren't anymore in the 'process' function ? Edit : i know x will not be reachable as if I type print x outside the function, there will be an error because he was defined in the scope of the function. I'm asking about memory and reference instead of variable name.

If yes, should i use myDataInstance.value = x[ix].copy() to create another array so the reference would be deleted when leaving the function ?

If no, where does it copy it ?

Thanks for the explanation

Upvotes: 2

Answers (3)

lightalchemist

Reputation: 10219

Fancy indexing, unlike slicing, does not return a view, so you will not end up holding a reference to the your big array. See official explanation on views vs copies in Numpy.

To directly answer your question, the part where you write myDataInstance.value = x[ix] is where the copying is done. You do not need to explicitly call copy unless you are doing slicing.

To delve deeper, one way you can check that a variable is a view of the numpy array is to use Numpy's shares_memory function

import numpy as np
X = np.arange(10)
x = X[np.where(X > 5)]
np.shares_memory(X, x)  # This outputs False

x = X[np.where(X >= 0)]
np.shares_memory(X, x)  # Still false

You can also use sys.getrefcount(var) to check the number of references pointing to a variable var at one time.

import sys
X = np.arange(10)
print(sys.getrefcount(X)) # This prints 2
x = X[np.where(X > 0)]
print(sys.getrefcount(X)) # This still prints 2

Note that the reason sys.getrefcount(X) prints 2 is that 1 reference is held by the variable X and the other is held by the function sys.getrefcount() and not x.

So in conclusion, you do not need to do an explicit copy if you are doing fancy indexing like in your example. If you are doing slicing, then that is a different story.

Upvotes: 2

Daneel R.

Reputation: 547

To delete a Python object an_object from memory, call del(an_object) and wait for the garbage collection to kick in. Garbage collection can also be interfered with manually with module gc, at your risk.

It it important to clarify that del(an_object) or similar deletion methods do not remove the object from memory, they only remove the name an_object from the namespace. You still have to wait for garbage collection.

UPDATE To answer the comment here below, we can check whether a slice of an array is a reference to the original array or not with the following code:

import numpy as np

x_old = np.arange(0,10,1) # x_old = np.array([0,1,2,3,4,5,6,7,8,9])

x_new_1 = x_old[:5] # We slice the array, without calling  .copy()
# x_new = np.array([0,1,2,3,4])

x_old[2]=100 # We change the third element of the original array, from 2 to 100
print(x_new_1) # The output is [  0   1 100   3   4]. x_new_1 is thus a reference to x_old,
# not a new object

x_old[2]= 2 # Restore original value
x_new_2 = x_old[:5].copy() # This time we call .copy() on the slice, or the whole array for that matter.
x_old[2]=100 # again we change the value

print(x_new_2) # The output is array([0, 1, 2, 3, 4])

Therefore, calling .copy() on the original array will create a new object, allowing you to delete the old one from the namespace and wait for its automatic deletion from memory. If you do not call .copy() you are still working with a reference to the old object and, as a consequence, whatever happens to the original object affetcts the reference.

What should you do if you want to remove from memory part of an array:

1) Copy the slice of the original array that you want to keep into a new array with a new name.

2) Call del or any other deletion instruction on the original array

3) Wait for its automatic deletion from memory

4) Continue working with the new object.

Since you are working with big arrays though, remember that you have both arrays loaded in memory for a certain amount of time if you use this process.

UPDATE 2

OP, as mentioned by @lightalchemist in the comment down below, the code you provided does not produce a reference to x, but rather a copy to it. The code you provided as an example does not fit the description of the problem you are facing.

Upvotes: 0

tda

Reputation: 2133

The variables specified in the scope of your process() will be removed from memory once you are no longer executing that function. You can see this in action by running the following:

class Data:
    value = None

def process(myDataInstance):
    x = np.arange(0, 750*350*365, dtype=np.int32)
    ix = np.where(x < 50000)
    myDataInstance.value = x[ix]

d = Data()
process(d)
print(ix)

>>> Traceback (most recent call last):
  File "/workspace/PRISE/src/datacube/prod_mngr/data_fusion.py", line 26, in <module>
    print(ix)
NameError: name 'ix' is not defined

You get a NameError as the variable ix is only defined in the scope of the process method.

NOTE: If you had self.ix = np.where(x < 50000) in the process() method, then after the process(d) line you'd be able to access the ix variable using print(Data.ix) because that allocates the variable to the Data() object which you have reference to globally.

EDIT for further clarification:

Once a variable is out of scope, it is removed from memory automatically in Python. see Garbage Collection in Python for more info.

Upvotes: 0

Numpy and memory with big arrays

Answers (3)

Related Questions