Dmitrij Holkin
Dmitrij Holkin

Reputation: 2055

Numpy array error setting an array element with a sequence

i tryed to use multilist to hold scraped data from html

but after 50.000 list append i got memory error

So i decided to change lists to numpy array

SapList= []
ListAll  =  np.array([])

def eachshop(): #filling each list for each shop data
    global ListAll
    SapList.append(RowNum)
    SapList.extend([sap]) # here can be from one to 10 values in one list["sap1","sap2","sap3",...,"sap10"]
    SapList.extend([[strLink,ProdName],ProdCode,ProdH,NewPrice, OldPrice,[FileName+'#Komp!A1',KompPrice],[FileName+'#Sav!A1','Sav']])
    SapList.extend([ss]) # here can be from null to 80 sublist with 3 values [["id1", "link", "address"],["id80", "link", "address"]]


    ListAll = np.append(np.array(SapList))

So then i do print(ListAll) i got exception C:\Python36\scrap.py, LINE 307 "ListAll = np.append(np.array(SapList))"): setting an array element with a sequence

now for speed up i using pool.map

def makePool(cP, func, iters):
    try:

        pool = ThreadPool(cP)
        #perebiraem Url
        pool.map_async(func,enumerate(iters, start=2)).get(99999)
        pool.close()
        pool.join()
    except:
        print('Pool Error')
        raise
    finally:
        pool.terminate()

So how to use numpy array in my example and reduce memory usage and speedup I\O operation using Numpy?

Upvotes: 2

Views: 339

Answers (2)

Dux
Dux

Reputation: 1246

As hpaulj pointed out already, numpy arrays will not help here, since you don't have consistent data sizes.

As Spinor8 suggested, dump out data in between instead:

AllList = []
limit = 10000
counter = 0
while not finished:
    if counter >= limit:
        print AllList
        AllList = []
    item = CreateYourList(...)
    AllList.append(item)
    counter += 1

Edit: Since your question is specifically asking about numpy and you even opened a bounty: numpy is not going to help you here, and here is why:

  • For using numpy efficiently, you have to know the array size at the time of array creation. numpy.array.append() doesn't actually append anything, but creates a new array, which is a huge overhead with large arrays.
  • Numpy arrays work best if all items have the same number of elements. Specifically, you can think of a numpy array like a matrix: all rows have the same number of columns.
  • You could create a numpy array based on the largest element in your data stream, but this would mean you allocate memory that you don't need (array elements that will never be filled). This will clearly not solve your memory problem.

So IMHO, your only way to solve this is to break your stream into chunks that your memory can handle, and stitch it together afterwards. Maybe write it to a (temporary) file and append to it?

Upvotes: 1

hpaulj
hpaulj

Reputation: 231395

It looks like you are trying to make an array from a list that contains a number and lists. Something like:

In [6]: np.array([1, [1,2],[3,4]])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-812a9ccb6ca0> in <module>()
----> 1 np.array([1, [1,2],[3,4]])

ValueError: setting an array element with a sequence.

It does work if all elements of lists

In [7]: np.array([[1], [1,2],[3,4,5]])
Out[7]: array([list([1]), list([1, 2]), list([3, 4, 5])], dtype=object)

But if they vary in length the result is an object array, not a 2d numeric array. Such an object dtype array is very much like a list of lists, containing pointers to lists elsewhere in memory.

A multidimensional numeric array can use less memory than a list of lists, but it isn't going to help if you need to make the lists first. And it does not help at all if the sublists vary in size.

Oh, and stay away from np.append. It's evil. Plus you misused it!

Upvotes: 5

Related Questions