Reputation: 900
I want to store my data in a nice way, so that I can later efficiently analyse/plot it using standard packages such as seaborn. Pandas seems to be the go to library for storing structured data. However, I cannot find a satisfactory way of using it to store non-uniform data.
Let's say I have 3 arrays, representing length of 100 rats, 1000 cats and 200 dogs.
ratLength = np.random.uniform(1, 2, 100)
catLength = np.random.uniform(2, 3, 1000)
dogLength = np.random.uniform(3, 4, 200)
To the best of my understanding, I have 2 options of storing such data in a Pandas dataframe
np.nan
so that they are all the same lengthIn both cases the storage structure forces me to significantly increase the memory footprint of my dataset. I presume one can construct even more pathological examples with more complicated datasets.
Question: Can you recommend a framework for python to handle non-uniform datasets in an efficient way. It can be a smarter way of using Pandas, or it can be a different data storage library, I'm ok with both.
Upvotes: 0
Views: 147
Reputation: 30609
Further to my comment (too long to write as comment):
s = pd.Series()
rat = np.random.uniform(1, 2, 100)
cat = np.random.uniform(2, 3, 1000)
dog = np.random.uniform(3, 4, 200)
s['rat'] = rat
s['cat'] = cat
s['dog'] = dog
s.memory_usage(deep=True)
#10892
s.memory_usage(deep=True, index=False)
#10712
sum(map(sys.getsizeof, (cat,rat,dog)))
#10688
So the additional memory requirement is 180 B for the index and 24 B for the series overhead.
s1 = pd.Series()
s1 = s1.append(pd.Series(rat, index=['rat']*len(rat)))
s1 = s1.append(pd.Series(cat, index=['cat']*len(cat)))
s1 = s1.append(pd.Series(dog, index=['dog']*len(dog)))
s1.memory_usage(deep=True)
#88400 --> 1300 strings! BAD
s1.index = pd.CategoricalIndex(s1.index)
s1.memory_usage(deep=True)
#11960 --> 3 categories
Upvotes: 1