john_science
john_science

Reputation: 6551

3D Array vs Sparse Matrix for Sparse Data

I am building a simple model of the Milky Way and one of the things I need to store is a 3D grid of mass densities.

The problem is that if I put a rectangular box around the galaxy, most of the grid cells are empty. Which leaves me saving a lot of useless zeros. So the naive array seems wasteful:

galaxy = [[[0 for k in xrange(1601)] for j in xrange(1601)] for i in xrange(253)]
# then fill in i,j,k values that are non-zero

I tried to build a sparse array using a dictionary:

for x in range(1601):
    for y in range(1601):
        for z in range (253):
            galaxy[str(x) + "," + str(y) + "," + str(z)] = # whatever

But, (aside from being ugly) the strings that I was using for keys were taking up more memory than I was saving. I got OutOfMemoryErrors because (I calculated) the keys alone were taking a couple of gigs of memory.

At some point, I will want to increase the resolution of my model, and that will mean a much larger grid. Is there a more efficient way to store my values than using a 3D-Array of floats?

I am also concerned about the time it takes to iterate through all the cells (or just the non-zero cells in my grid. This will be very important.

Upvotes: 3

Views: 1638

Answers (3)

Robᵩ
Robᵩ

Reputation: 168646

Quick math: 1601 * 1601 * 253 => 648489853 items. A test indicates that the dictionary takes about 24 bytes per entry on a 32-bit machine, 49 bytes on a 64-bit machine, so that's 15,563,756,472 bytes (or 30GB on 64-bit). 10% of that is 1.5GB (or 3.0GB on 64-bit). If you have a 64-bit system with a bunch of memory, I think you'll be okay with a sparse representation.

I recommend:

  1. Use a tuple as the key, not a string, and
  2. Use a sparse storage system where you don't store zero values.

Here is one possibility:

class SparseDict(dict):
  def __init__(self, default_value):
    dict.__init__(self)
    self._value = default_value
  def __getitem__(self, key):
    try:
      return dict.__getitem__(self, key)
    except KeyError:
      return self._value
  def __setitem__(self, key, val):
    # I'm sure this can go faster if I were smarter
    if val == self._value:
      if  key in self:
        del self[key]
    else:
      dict.__setitem__(self, key, val)

def test(galaxy):
  import sys
  print len(galaxy), sys.getsizeof(galaxy)

  # test is 1/10th size in each dimension,
  # so 1/1000th of the volume
  for x in range(160):
    for y in range(160):
      for z in range (25):
        import random
        # 90% of space is essentially a vacuum
        if random.random() < .1:
          galaxy[x,y,z] = 1502100
        else:
          galaxy[x,y,z] = 0

  print len(galaxy), sys.getsizeof(galaxy)

test(SparseDict(0))

Upvotes: 2

omer schleifer
omer schleifer

Reputation: 3935

Maybe try to save your data in sql tables and load only a subset of the cube according to your needs. this will cost you in time to load parts , but will save you memory. as for the in memory represntation, use the methods suggested in the other answers such as dictionary etc...

Upvotes: 0

cdjc
cdjc

Reputation: 1108

Try using the dictionary approach, but only store key:value pairs for keys whose value is non-zero. A better key might be a tuple of (x,y,z).

Upvotes: 2

Related Questions