quant
quant

Reputation: 23152

Efficiently converting a Series of dictionaries to a DataFrame

I have a large (ish) Series of dictionaries that I'd like "flatten". In order to test / reproduce my problem I have created a Series with a similar structure:

>>> my_series = pd.Series([{'A': [1], 'B' : []}, {'A' : [1, 2], 'B' : [3, 4]}])
>>> my_series
0           {u'A': [1], u'B': []}
1    {u'A': [1, 2], u'B': [3, 4]}
dtype: object

The next step is converting it to a 'DataFrame' with a hierarchical index. I have found a way that works functionally:

>>> pd.DataFrame(pd.DataFrame.from_dict(row, orient='index').stack() for row in my_series)
     A         B     
     0    1    0    1
0  1.0  NaN  NaN  NaN
1  1.0  2.0  3.0  4.0

This gives me what I want, but on my actual dataset it is prohibitively slow, taking around 60 seconds for 30,000 rows with 0-4 elements in each of the lists, and using around 8G of RAM.

I have tried using a parallel map operation with the multiprocessing module to speed things up, but thought I'd ask here to see if there was a better way.

Can I achieve the same outcome as above in a more reasonable time?

Upvotes: 2

Views: 582

Answers (3)

Allen Qin
Allen Qin

Reputation: 19957

Setup

my_series = pd.Series([{'A': [1], 'B' : []}, {'A' : [1, 2], 'B' : [3, 4]}])
df = pd.DataFrame.from_dict(s.tolist())  

Solution

Ok, got a faster solution working.

idx = pd.MultiIndex.from_product([['A','B'],[0,1]])

pd.DataFrame(pd.DataFrame(df.values.flatten().tolist()).values.reshape(2,-1), columns=idx)

Out[1051]: 
     A         B     
     0    1    0    1
0  1.0  NaN  NaN  NaN
1  1.0  2.0  3.0  4.0

Old Solution

#Convert list elements to columns
df_A = df.A.apply(pd.Series).stack().unstack()
df_B = df.B.apply(pd.Series).stack().unstack()
#rename columns
df_A.columns = ['A_' + str(e) for e in df_A.columns]
df_B.columns = ['B_' + str(e) for e in df_B.columns]
#combine two dataframes
pd.concat([df_A,df_B],axis=1)

Out[973]: 
   A_0  A_1  B_0  B_1
0  1.0  NaN  NaN  NaN
1  1.0  2.0  3.0  4.0

Test

%timeit pd.DataFrame(pd.DataFrame(df.values.flatten().tolist()).values.reshape(2,-1), columns=idx)
1000 loops, best of 3: 378 µs per loop

%timeit pd.concat([pd.DataFrame(df[x].values.tolist()) for x in df.columns], axis=1,keys=df.columns)
1000 loops, best of 3: 1.22 ms per loop

Upvotes: 0

jezrael
jezrael

Reputation: 863631

You can use DataFrame constructor, but first convert Series to numpy array by values and then to list:

a = pd.DataFrame(my_series.values.tolist())
print (a)
        A       B
0     [1]      []
1  [1, 2]  [3, 4]

Then for flatennig is possible use concat with list comprehension:

b = pd.concat([pd.DataFrame(a[x].values.tolist()) for x in a.columns], axis=1,keys=a.columns)
print (b)
   A         B     
   0    1    0    1
0  1  NaN  NaN  NaN
1  1  2.0  3.0  4.0

If convert to numpy array is is faster:

In [93]: %timeit pd.DataFrame(list(my_series))
1000 loops, best of 3: 550 µs per loop

In [94]: %timeit pd.DataFrame(my_series.values.tolist())
1000 loops, best of 3: 516 µs per loop

Upvotes: 1

Kasravnd
Kasravnd

Reputation: 107347

First off, since you have your dictionaries in a pandas-based data structure you might be able to create a DataFrame instead of a series.

Secondly DataFrame can accept a list of dictionaries and construct the expected result for you. So, if you don't have the control over the construction of the series in the first place, you can just convert the series to list and pass it to DataFrame:

In [10]: pd.DataFrame(list(my_series))
Out[10]: 
        A       B
0     [1]      []
1  [1, 2]  [3, 4]

Upvotes: 1

Related Questions