Reputation: 2055
i tryed to use multilist to hold scraped data from html
but after 50.000 list append i got memory error
So i decided to change lists to numpy array
SapList= []
ListAll = np.array([])
def eachshop(): #filling each list for each shop data
global ListAll
SapList.append(RowNum)
SapList.extend([sap]) # here can be from one to 10 values in one list["sap1","sap2","sap3",...,"sap10"]
SapList.extend([[strLink,ProdName],ProdCode,ProdH,NewPrice, OldPrice,[FileName+'#Komp!A1',KompPrice],[FileName+'#Sav!A1','Sav']])
SapList.extend([ss]) # here can be from null to 80 sublist with 3 values [["id1", "link", "address"],["id80", "link", "address"]]
ListAll = np.append(np.array(SapList))
So then i do print(ListAll)
i got exception C:\Python36\scrap.py, LINE 307 "ListAll = np.append(np.array(SapList))"): setting an array element with a sequence
now for speed up i using pool.map
def makePool(cP, func, iters):
try:
pool = ThreadPool(cP)
#perebiraem Url
pool.map_async(func,enumerate(iters, start=2)).get(99999)
pool.close()
pool.join()
except:
print('Pool Error')
raise
finally:
pool.terminate()
So how to use numpy array in my example and reduce memory usage and speedup I\O operation using Numpy?
Upvotes: 2
Views: 339
Reputation: 1246
As hpaulj pointed out already, numpy
arrays will not help here, since you don't have consistent data sizes.
As Spinor8 suggested, dump out data in between instead:
AllList = []
limit = 10000
counter = 0
while not finished:
if counter >= limit:
print AllList
AllList = []
item = CreateYourList(...)
AllList.append(item)
counter += 1
Edit: Since your question is specifically asking about numpy and you even opened a bounty: numpy is not going to help you here, and here is why:
numpy.array.append()
doesn't actually append anything, but creates a new array, which is a huge overhead with large arrays.So IMHO, your only way to solve this is to break your stream into chunks that your memory can handle, and stitch it together afterwards. Maybe write it to a (temporary) file and append to it?
Upvotes: 1
Reputation: 231395
It looks like you are trying to make an array from a list that contains a number and lists. Something like:
In [6]: np.array([1, [1,2],[3,4]])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-812a9ccb6ca0> in <module>()
----> 1 np.array([1, [1,2],[3,4]])
ValueError: setting an array element with a sequence.
It does work if all elements of lists
In [7]: np.array([[1], [1,2],[3,4,5]])
Out[7]: array([list([1]), list([1, 2]), list([3, 4, 5])], dtype=object)
But if they vary in length the result is an object array, not a 2d numeric array. Such an object dtype array is very much like a list of lists, containing pointers to lists elsewhere in memory.
A multidimensional numeric array can use less memory than a list of lists, but it isn't going to help if you need to make the lists first. And it does not help at all if the sublists vary in size.
Oh, and stay away from np.append
. It's evil. Plus you misused it!
Upvotes: 5