Aleksejs Fomins
Aleksejs Fomins

Reputation: 900

Memory-efficient way of representing non-uniform datasets in python

I want to store my data in a nice way, so that I can later efficiently analyse/plot it using standard packages such as seaborn. Pandas seems to be the go to library for storing structured data. However, I cannot find a satisfactory way of using it to store non-uniform data.

Let's say I have 3 arrays, representing length of 100 rats, 1000 cats and 200 dogs.

ratLength = np.random.uniform(1, 2, 100)
catLength = np.random.uniform(2, 3, 1000)
dogLength = np.random.uniform(3, 4, 200)

To the best of my understanding, I have 2 options of storing such data in a Pandas dataframe

  1. One row per animal. The first column would be animal kind (rat/cat/dog), the second the length.
  2. One row per triplet of rat-cat-dog. In that case, I would have to pad the shorter arrays with np.nan so that they are all the same length

In both cases the storage structure forces me to significantly increase the memory footprint of my dataset. I presume one can construct even more pathological examples with more complicated datasets.

Question: Can you recommend a framework for python to handle non-uniform datasets in an efficient way. It can be a smarter way of using Pandas, or it can be a different data storage library, I'm ok with both.

Upvotes: 0

Views: 147

Answers (1)

Stef
Stef

Reputation: 30609

Further to my comment (too long to write as comment):

s = pd.Series()
rat = np.random.uniform(1, 2, 100)
cat = np.random.uniform(2, 3, 1000)
dog = np.random.uniform(3, 4, 200)
s['rat'] = rat
s['cat'] = cat
s['dog'] = dog
s.memory_usage(deep=True)
#10892
s.memory_usage(deep=True, index=False)
#10712
sum(map(sys.getsizeof, (cat,rat,dog)))
#10688

So the additional memory requirement is 180 B for the index and 24 B for the series overhead.


You can also store your data in long form (I think this is what you meant in your comment). In this case you should use a categorical index to store just 3 instead of 1300 strings:

s1 = pd.Series()
s1 = s1.append(pd.Series(rat, index=['rat']*len(rat)))
s1 = s1.append(pd.Series(cat, index=['cat']*len(cat)))
s1 = s1.append(pd.Series(dog, index=['dog']*len(dog)))
s1.memory_usage(deep=True)
#88400 --> 1300 strings! BAD
s1.index = pd.CategoricalIndex(s1.index)
s1.memory_usage(deep=True)
#11960 --> 3 categories

Upvotes: 1

Related Questions