Reputation: 739
I am making a package that reads a binary file and returns data that can be used to initialize a DataFrame
, I am now wondering if it is best to return a dict
or two lists (one that holds the keys and one that holds the values).
The package I am making is not supposed to be entirely reliant on a DataFrame
object, which is why my package currently outputs the data as a dict
(for easy access). If there could be some memory and speed savings (which is paramount for my application as I am dealing with millions of data points), I would like to output the key and value lists instead. These iterables would then be used to initialize a DataFrame
.
Here is a simple example:
In [1]: d = {(1,1,1): '111',
...: (2,2,2): '222',
...: (3,3,3): '333',
...: (4,4,4): '444'}
In [2]: keyslist=[(1,1,1),(2,2,2),(3,3,3),(4,4,4)]
In [3]: valslist=['111','222','333','444']
In [4]: import pandas as pd
In [5]: dfdict=pd.DataFrame(d.values(), index=pd.MultiIndex.from_tuples(d.keys(), names=['a','b','c']))
In [6]: dfdict
Out[6]:
0
a b c
3 3 3 333
2 2 2 222
1 1 1 111
4 4 4 444
In [7]: dflist=pd.DataFrame(valslist, index=pd.MultiIndex.from_tuples(keyslist, names=['a','b','c']))
In [8]: dfpair
Out[8]:
0
a b c
1 1 1 111
2 2 2 222
3 3 3 333
4 4 4 444
It is my understanding that d.values()
and d.keys()
is creating a new copy of the data. If we disregard the fact the a dict
takes more memory then a list
, does using d.values()
and d.keys()
lead to more memory usage then the list
pair implementation?
Upvotes: 3
Views: 7322
Reputation: 739
Here are the benchmarks using memory_profiler
:
Filename: testdict.py
Line # Mem usage Increment Line Contents
================================================
4 66.2 MiB 0.0 MiB @profile
5 def testdict():
6
7 66.2 MiB 0.0 MiB d = {}
8
9 260.6 MiB 194.3 MiB for i in xrange(0,1000000):
10 260.6 MiB 0.0 MiB d[(i,i,i)]=str(i)*3
11
12 400.2 MiB 139.6 MiB dfdict=pd.DataFrame(d.values(), index=
pd.MultiIndex.from_tuples(d.keys(), names=['a','b','c']))
Filename: testlist.py
Line # Mem usage Increment Line Contents
================================================
4 66.5 MiB 0.0 MiB @profile
5 def testlist():
6
7 66.5 MiB 0.0 MiB keyslist=[]
8 66.5 MiB 0.0 MiB valslist=[]
9
10 229.3 MiB 162.8 MiB for i in xrange(0,1000000):
11 229.3 MiB 0.0 MiB keyslist.append((i,i,i))
12 229.3 MiB 0.0 MiB valslist.append(str(i)*3)
13
14 273.6 MiB 44.3 MiB dflist=pd.DataFrame(valslist, index=
pd.MultiIndex.from_tuples(keyslist, names=['a','b','c']))
For the same task and memory types, it does seem the dictionary implementation is not as memory efficient.
For some reason, when I change the values to arrays of numbers (more representative of my data) I get very similar performance, does anyone know why this is happening?
Filename: testdict.py
Line # Mem usage Increment Line Contents
================================================
4 66.9 MiB 0.0 MiB @profile
5 def testdict():
6
7 66.9 MiB 0.0 MiB d = {}
8
9 345.6 MiB 278.7 MiB for i in xrange(0,1000000):
10 345.6 MiB 0.0 MiB d[(i,i,i)]=[0]*9
11
12 546.2 MiB 200.6 MiB dfdict=pd.DataFrame(d.values(), index=
pd.MultiIndex.from_tuples(d.keys(), names=['a','b','c']))
Filename: testlist.py
Line # Mem usage Increment Line Contents
================================================
4 66.3 MiB 0.0 MiB @profile
5 def testlist():
6
7 66.3 MiB 0.0 MiB keyslist=[]
8 66.3 MiB 0.0 MiB valslist=[]
9
10 314.7 MiB 248.4 MiB for i in xrange(0,1000000):
11 314.7 MiB 0.0 MiB keyslist.append((i,i,i))
12 314.7 MiB 0.0 MiB valslist.append([0]*9)
13
14 515.2 MiB 200.6 MiB dflist=pd.DataFrame(valslist, index=
pd.MultiIndex.from_tuples(keyslist, names=['a','b','c']))
Upvotes: 0
Reputation: 4085
I made memory profiling of 1M rows. The winning structure is to use array.array for every numerical index and a list for strings (147MB data and 310MB conversion to pandas).
According to Python manual
Arrays are sequence types and behave very much like lists, except that the type of objects stored in them is constrained.
They even have append method and most likely have very fast append speed.
Second place goes to two separate lists. (308MB and 450MB)
The other two options, using a dict and using a list with tuples of four, were the worst. Dict: 339MB, 524MB. List of four: 308MB, 514MB.
Here is the use of array.array:
In [1]: from array import array
In [2]: import gc
In [3]: import pandas as pd
In [4]: %load_ext memory_profiler
In [5]: a1=array("l",range(1000000))
In [6]: a2=array("l",range(1000000))
In [7]: a3=array("l",range(1000000))
In [8]: b=[str(x*111) for x in list(range(1000000))]
In [9]: gc.collect()
Out[9]: 0
In [10]: %memit a1,a2,a3,b
peak memory: 147.64 MiB, increment: 0.32 MiB
In [11]: %memit dfpair=pd.DataFrame(b, index=pd.MultiIndex.from_arrays([a1,a2,a3], names=['a','b','c']))
peak memory: 310.60 MiB, increment: 162.91 MiB
Here is the rest of the code (very long):
List of tuples of four:
In [1]: import gc
In [2]: import pandas as pd
In [3]: %load_ext memory_profiler
In [4]: a=list(zip(list(range(1000000)),list(range(1000000)),list(range(1000000))))
In [5]: b=[str(x*111) for x in list(range(1000000))]
In [6]: d2=[x+(b[i],) for i,x in enumerate(a)]
In [7]: del a
In [8]: del b
In [9]: gc.collect()
Out[9]: 0
In [10]: %memit d2
peak memory: 308.40 MiB, increment: 0.28 MiB
In [11]: %memit df = pd.DataFrame(d2, columns=['a','b','c','d']).set_index(['a','b','c'])
peak memory: 514.21 MiB, increment: 205.80 MiB
Dictionary:
In [1]: import gc
In [2]: import pandas as pd
In [3]: %load_ext memory_profiler
In [4]: a=list(zip(list(range(1000000)),list(range(1000000)),list(range(1000000))))
In [5]: b=[str(x*111) for x in list(range(1000000))]
In [6]: d = dict(zip(a, b))
In [7]: del a
In [8]: del b
In [9]: gc.collect()
Out[9]: 0
In [10]: %memit d
peak memory: 339.14 MiB, increment: 0.23 MiB
In [11]: %memit dfdict=pd.DataFrame(list(d.values()), index=pd.MultiIndex.from_tuples(d.keys(), names=['a','b','c']))
peak memory: 524.10 MiB, increment: 184.95 MiB
Two arrays:
In [1]: import gc
In [2]: import pandas as pd
In [3]: %load_ext memory_profiler
In [4]: a=list(zip(list(range(1000000)),list(range(1000000)),list(range(1000000))))
In [5]: b=[str(x*111) for x in list(range(1000000))]
In [6]: gc.collect()
Out[6]: 0
In [7]: %memit a,b
peak memory: 307.75 MiB, increment: 0.19 MiB
In [8]: %memit dfpair=pd.DataFrame(b, index=pd.MultiIndex.from_tuples(a, names=['a','b','c']))
peak memory: 459.94 MiB, increment: 152.19 MiB
Upvotes: 4