Reputation: 1235
I have a very large pandas.Series
of shape (200000, )
containing dict
(s)
In[11]: series.head()
Out[12]:
train-1 {u'MI vs KKR': 7788, u'India vs Australia 2nd ...
train-10 {u'England Smarter with the Ball': 92, u'Dhoni...
train-100 {u'Star Sports 4': 13, u'Manchester United vs ...
train-1000 {u'SRH vs RCB': 701, u'KKR vs KXIP': 1042, u'M...
train-10000 {u'MI vs KKR': 304, u'Yeh Rishta Kya Kehlata H...
Name: titles, dtype: object
I want to create a DataFrame
from the series - which could have been done by:
df = pd.DataFrame(series.values.tolist(), index=series.index).fillna(0)
From the above code it's clear that I want to create a column for every unique key from all the dictionaries and fill up the numerical value if present in the dictionary if not fill it with zero - which is done by fillna(0)
:
It is not possible to show what I want for my dataset but in a nutshell below is the code for what I want to do using small dummy data:
small_series = pd.Series([{'a':1, 'b': 2}, {'b': 3, 'c': 4}])
small_series
Out[15]:
0 {u'a': 1, u'b': 2}
1 {u'c': 4, u'b': 3}
dtype: object
pd.DataFrame(small_series.values.tolist()).fillna(0)
Out[17]:
a b c
0 1.0 2 0.0
1 0.0 3 4.0
Well this is straight forward, but a problem arises when the dictionary is HUGE and when I use the above technique it takes up all my RAM(16Gigs) and half of my SWAP memory(32 Gigs) and even then it never stops!
I have searched and people recommend using sparse data structures but I need to create a dense one first and then I can convert it to sparse !
Please help me create the dataframe - with limited 16 gigs of memory!
Here's the ready made template that will help (title.pic file(pickle'd in python 2.7)):
import pickle
import pandas as pd
series = pickle.load(open('titles.pic', 'rb'))
# print series
# This is where it take up the whole memory and forever long!
df = pd.DataFrame(series.values.tolist(), index=series.index).fillna(0)
Any help/approach would be appreciated !
Upvotes: 0
Views: 77
Reputation: 3130
Your dataframe would have 200k*10k=2 billion elements, which roughly translates to 2GB if every element was only 1 byte. Clearly a dense representation won't work, so you need to use a SparseDataFrame
:
pd.SparseDataFrame.from_records(small_series.values)
Upvotes: 1