Reputation: 311
I need to create about 2 million vectors w/ 1000 slots in each (each slot merely contains an integer).
What would be the best data structure for working with this amount of data? It could be that I'm over-estimating the amount of processing/memory involved.
I need to iterate over a collection of files (about 34.5GB in total) and update the vectors each time one of the the 2-million items (each corresponding to a vector) is encountered on a line.
I could easily write code for this, but I know it wouldn't be optimal enough to handle the volume of the data, which is why I'm asking you experts. :)
Best, Georgina
Upvotes: 3
Views: 2180
Reputation: 68702
You might be memory bound on your machine. Without cleaning up running programs:
a = numpy.zeros((1000000,1000),dtype=int)
wouldn't fit into memory. But in general if you could break the problem up such that you don't need the entire array in memory at once, or you can use a sparse representation, I would go with numpy
(scipy
for the sparse representation).
Also, you could think about storing the data in hdf5
with h5py
or pytables
or netcdf4
with netcdf4-python
on disk and then access the portions you need.
Upvotes: 5
Reputation: 37919
If you need to work in RAM try the scipy.sparse matrix variants. It includes algorithms to efficiently manipulate sparse matrices.
Upvotes: 1