snowleopard
snowleopard

Reputation: 739

Memory Usage, Filling Pandas DataFrame using Dict vs using key and value Lists

I am making a package that reads a binary file and returns data that can be used to initialize a DataFrame, I am now wondering if it is best to return a dict or two lists (one that holds the keys and one that holds the values).

The package I am making is not supposed to be entirely reliant on a DataFrame object, which is why my package currently outputs the data as a dict (for easy access). If there could be some memory and speed savings (which is paramount for my application as I am dealing with millions of data points), I would like to output the key and value lists instead. These iterables would then be used to initialize a DataFrame.

Here is a simple example:

In [1]: d = {(1,1,1): '111',
   ...: (2,2,2): '222',
   ...: (3,3,3): '333',
   ...: (4,4,4): '444'}

In [2]: keyslist=[(1,1,1),(2,2,2),(3,3,3),(4,4,4)]

In [3]: valslist=['111','222','333','444']

In [4]: import pandas as pd

In [5]: dfdict=pd.DataFrame(d.values(),  index=pd.MultiIndex.from_tuples(d.keys(), names=['a','b','c']))

In [6]: dfdict
Out[6]: 
         0
a b c     
3 3 3  333
2 2 2  222
1 1 1  111
4 4 4  444

In [7]: dflist=pd.DataFrame(valslist,  index=pd.MultiIndex.from_tuples(keyslist, names=['a','b','c']))

In [8]: dfpair
Out[8]: 
         0
a b c     
1 1 1  111
2 2 2  222
3 3 3  333
4 4 4  444

It is my understanding that d.values() and d.keys() is creating a new copy of the data. If we disregard the fact the a dict takes more memory then a list, does using d.values() and d.keys() lead to more memory usage then the list pair implementation?

Upvotes: 3

Views: 7322

Answers (2)

snowleopard
snowleopard

Reputation: 739

Here are the benchmarks using memory_profiler:

Filename: testdict.py

Line #    Mem usage    Increment   Line Contents
================================================
     4     66.2 MiB      0.0 MiB   @profile
     5                             def testdict():
     6
     7     66.2 MiB      0.0 MiB        d = {}
     8
     9    260.6 MiB    194.3 MiB        for i in xrange(0,1000000):
    10    260.6 MiB      0.0 MiB                d[(i,i,i)]=str(i)*3
    11
    12    400.2 MiB    139.6 MiB        dfdict=pd.DataFrame(d.values(),  index=
pd.MultiIndex.from_tuples(d.keys(), names=['a','b','c']))

Filename: testlist.py

Line #    Mem usage    Increment   Line Contents
================================================
     4     66.5 MiB      0.0 MiB   @profile
     5                             def testlist():
     6
     7     66.5 MiB      0.0 MiB        keyslist=[]
     8     66.5 MiB      0.0 MiB        valslist=[]
     9
    10    229.3 MiB    162.8 MiB        for i in xrange(0,1000000):
    11    229.3 MiB      0.0 MiB                keyslist.append((i,i,i))
    12    229.3 MiB      0.0 MiB                valslist.append(str(i)*3)
    13
    14    273.6 MiB     44.3 MiB        dflist=pd.DataFrame(valslist,  index=
pd.MultiIndex.from_tuples(keyslist, names=['a','b','c']))

For the same task and memory types, it does seem the dictionary implementation is not as memory efficient.

Edit

For some reason, when I change the values to arrays of numbers (more representative of my data) I get very similar performance, does anyone know why this is happening?

Filename: testdict.py

Line #    Mem usage    Increment   Line Contents
================================================
     4     66.9 MiB      0.0 MiB   @profile
     5                             def testdict():
     6
     7     66.9 MiB      0.0 MiB        d = {}
     8
     9    345.6 MiB    278.7 MiB        for i in xrange(0,1000000):
    10    345.6 MiB      0.0 MiB                d[(i,i,i)]=[0]*9
    11
    12    546.2 MiB    200.6 MiB        dfdict=pd.DataFrame(d.values(),  index=
pd.MultiIndex.from_tuples(d.keys(), names=['a','b','c']))

Filename: testlist.py

Line #    Mem usage    Increment   Line Contents
================================================
     4     66.3 MiB      0.0 MiB   @profile
     5                             def testlist():
     6
     7     66.3 MiB      0.0 MiB        keyslist=[]
     8     66.3 MiB      0.0 MiB        valslist=[]
     9
    10    314.7 MiB    248.4 MiB        for i in xrange(0,1000000):
    11    314.7 MiB      0.0 MiB                keyslist.append((i,i,i))
    12    314.7 MiB      0.0 MiB                valslist.append([0]*9)
    13
    14    515.2 MiB    200.6 MiB        dflist=pd.DataFrame(valslist,  index=
pd.MultiIndex.from_tuples(keyslist, names=['a','b','c']))

Upvotes: 0

keiv.fly
keiv.fly

Reputation: 4085

I made memory profiling of 1M rows. The winning structure is to use array.array for every numerical index and a list for strings (147MB data and 310MB conversion to pandas).

According to Python manual

Arrays are sequence types and behave very much like lists, except that the type of objects stored in them is constrained.

They even have append method and most likely have very fast append speed.

Second place goes to two separate lists. (308MB and 450MB)

The other two options, using a dict and using a list with tuples of four, were the worst. Dict: 339MB, 524MB. List of four: 308MB, 514MB.

Here is the use of array.array:

In [1]: from array import array
In [2]: import gc
In [3]: import pandas as pd
In [4]: %load_ext memory_profiler
In [5]: a1=array("l",range(1000000))
In [6]: a2=array("l",range(1000000))
In [7]: a3=array("l",range(1000000))
In [8]: b=[str(x*111) for x in list(range(1000000))]
In [9]: gc.collect()
Out[9]: 0
In [10]: %memit a1,a2,a3,b
peak memory: 147.64 MiB, increment: 0.32 MiB
In [11]: %memit dfpair=pd.DataFrame(b,  index=pd.MultiIndex.from_arrays([a1,a2,a3], names=['a','b','c']))
peak memory: 310.60 MiB, increment: 162.91 MiB

Here is the rest of the code (very long):

List of tuples of four:

In [1]: import gc
In [2]: import pandas as pd
In [3]: %load_ext memory_profiler
In [4]: a=list(zip(list(range(1000000)),list(range(1000000)),list(range(1000000))))
In [5]: b=[str(x*111) for x in list(range(1000000))]
In [6]: d2=[x+(b[i],) for i,x in enumerate(a)]
In [7]: del a
In [8]: del b
In [9]: gc.collect()
Out[9]: 0
In [10]: %memit d2
peak memory: 308.40 MiB, increment: 0.28 MiB
In [11]: %memit df = pd.DataFrame(d2, columns=['a','b','c','d']).set_index(['a','b','c'])
peak memory: 514.21 MiB, increment: 205.80 MiB

Dictionary:

In [1]: import gc
In [2]: import pandas as pd
In [3]: %load_ext memory_profiler
In [4]: a=list(zip(list(range(1000000)),list(range(1000000)),list(range(1000000))))
In [5]: b=[str(x*111) for x in list(range(1000000))]
In [6]: d = dict(zip(a, b))
In [7]: del a
In [8]: del b
In [9]: gc.collect()
Out[9]: 0
In [10]: %memit d
peak memory: 339.14 MiB, increment: 0.23 MiB
In [11]: %memit dfdict=pd.DataFrame(list(d.values()),  index=pd.MultiIndex.from_tuples(d.keys(), names=['a','b','c']))
peak memory: 524.10 MiB, increment: 184.95 MiB

Two arrays:

In [1]: import gc
In [2]: import pandas as pd
In [3]: %load_ext memory_profiler
In [4]: a=list(zip(list(range(1000000)),list(range(1000000)),list(range(1000000))))
In [5]: b=[str(x*111) for x in list(range(1000000))]
In [6]: gc.collect()
Out[6]: 0
In [7]: %memit a,b
peak memory: 307.75 MiB, increment: 0.19 MiB
In [8]: %memit dfpair=pd.DataFrame(b,  index=pd.MultiIndex.from_tuples(a, names=['a','b','c']))
peak memory: 459.94 MiB, increment: 152.19 MiB

Upvotes: 4

Related Questions