Devi Prasad Khatua
Devi Prasad Khatua

Reputation: 1235

Create large DataFrame with limited resource(RAM)

I have a very large pandas.Series of shape (200000, ) containing dict(s)

In[11]: series.head()
Out[12]: 
train-1        {u'MI vs KKR': 7788, u'India vs Australia 2nd ...
train-10       {u'England Smarter with the Ball': 92, u'Dhoni...
train-100      {u'Star Sports 4': 13, u'Manchester United vs ...
train-1000     {u'SRH vs RCB': 701, u'KKR vs KXIP': 1042, u'M...
train-10000    {u'MI vs KKR': 304, u'Yeh Rishta Kya Kehlata H...
Name: titles, dtype: object

I want to create a DataFrame from the series - which could have been done by:

df = pd.DataFrame(series.values.tolist(), index=series.index).fillna(0)

From the above code it's clear that I want to create a column for every unique key from all the dictionaries and fill up the numerical value if present in the dictionary if not fill it with zero - which is done by fillna(0):

It is not possible to show what I want for my dataset but in a nutshell below is the code for what I want to do using small dummy data:

small_series = pd.Series([{'a':1, 'b': 2}, {'b': 3, 'c': 4}])

small_series
Out[15]: 
0    {u'a': 1, u'b': 2}
1    {u'c': 4, u'b': 3}
dtype: object


pd.DataFrame(small_series.values.tolist()).fillna(0)
Out[17]: 
     a  b    c
0  1.0  2  0.0
1  0.0  3  4.0

Well this is straight forward, but a problem arises when the dictionary is HUGE and when I use the above technique it takes up all my RAM(16Gigs) and half of my SWAP memory(32 Gigs) and even then it never stops!

I have searched and people recommend using sparse data structures but I need to create a dense one first and then I can convert it to sparse !

Please help me create the dataframe - with limited 16 gigs of memory!

Here's the ready made template that will help (title.pic file(pickle'd in python 2.7)):

import pickle
import pandas as pd

series = pickle.load(open('titles.pic', 'rb'))


# print series

# This is where it take up the whole memory and forever long!
df = pd.DataFrame(series.values.tolist(), index=series.index).fillna(0)

Any help/approach would be appreciated !

Upvotes: 0

Views: 77

Answers (1)

Ken Wei
Ken Wei

Reputation: 3130

Your dataframe would have 200k*10k=2 billion elements, which roughly translates to 2GB if every element was only 1 byte. Clearly a dense representation won't work, so you need to use a SparseDataFrame:

pd.SparseDataFrame.from_records(small_series.values)

Upvotes: 1

Related Questions