Reputation: 7962
If I have a huge list of lists in memory and I wish to convert it into an array, does the naive approach cause python to make a copy of all the data, taking twice the space in memory? Should I convert a list of lists, vector by vector instead by popping?
# for instance
list_of_lists = [[...], ..., [...]]
arr = np.array(list_of_lists)
Edit:
Is it better to create an empty array of a known size and then populate it incrementally thus avoiding the list_of_lists
object entirely? Could this be accomplished by something as simply as some_array[i] = some_list_of_float_values
?
Upvotes: 0
Views: 1323
Reputation: 10650
I'm just puttign theis here as it's a bit long for a comment.
Have you read the numpy documentation for array
?
numpy.array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0)
"""
...
copy : bool, optional
If true (default), then the object is copied. Otherwise, a copy will
only be made if __array__ returns a copy, if obj is a nested sequence,
or if a copy is needed to satisfy any of the other requirements (dtype,
order, etc.).
...
"""
When you say you don't want to copy the data of the original array when creating the numpy
array, what data structure are you hoping to end up with?
A lot of the speed up you get from using numpy
is because the C arrays that are created are contiguous in memory. An array in python is just an array of pointers to objects, so you have to go and find the objects every time - which isn't the case in numpy
, as it's not written in python.
If you want to just have the numpy
array reference the python arrays in your 2D array, then you'll lose the performance gains.
if you do np.array(my_2D_python_array, copy=False)
i don't know what it will actually produce, but you could easily test it yourself. Look at the shape of the array, and see what kind of objects it houses.
If you want the numpy array to be contiguous though, as some point you're going to have to allocate all of the memory it needs (which if it's as large as you're suggesting, it sounds like it might be difficult to find a contiguous section large enough).
Sorry that was pretty rambling, just a comment. How big are the actual arrays you're looking at?
Here's a plot of the cpu usage and memory usage of a small sample program:
from __future__ import division
#Make a large python 2D array
N, M = 10000, 18750
print "%i x %i = %i doubles = %f GB" % (N, M, N * M, N*M*8/10**9)
#grab pid to moniter memory and cpu usage
import os
pid = os.getpid()
os.system("python moniter.py -p " + str(pid) + " &")
print "building python matrix"
large_2d_array = [[n + m*M for n in range(N)] for m in range(M)]
import numpy
from datetime import datetime
print datetime.now(), "creating numpy array with copy"
np1 = numpy.array(large_2d_array, copy=True)
print datetime.now(), "deleting array"
del(np1)
print datetime.now(), "creating numpy array with copy"
np1 = numpy.array(large_2d_array, copy=False)
print datetime.now(), "deleting array"
del(np1)
1, 2, and 3 are the points where each of the matrices finish being created. Note that the native python array takes up much more memory than the numpy ones - python objects each have their own overhead, and the lists are lists of objects. For the numpy array this is not the case, so it is considerably smaller.
Also note that using the copy on the python object has no effect - new data is always created. You could get around this by creating a numpy array of python objects (using dtype=object
), but i wouldn't advise it.
Upvotes: 2