Reputation: 23152
I have a large (ish) Series
of dictionaries that I'd like "flatten". In order to test / reproduce my problem I have created a Series
with a similar structure:
>>> my_series = pd.Series([{'A': [1], 'B' : []}, {'A' : [1, 2], 'B' : [3, 4]}])
>>> my_series
0 {u'A': [1], u'B': []}
1 {u'A': [1, 2], u'B': [3, 4]}
dtype: object
The next step is converting it to a 'DataFrame' with a hierarchical index. I have found a way that works functionally:
>>> pd.DataFrame(pd.DataFrame.from_dict(row, orient='index').stack() for row in my_series)
A B
0 1 0 1
0 1.0 NaN NaN NaN
1 1.0 2.0 3.0 4.0
This gives me what I want, but on my actual dataset it is prohibitively slow, taking around 60 seconds for 30,000 rows with 0-4 elements in each of the lists, and using around 8G of RAM.
I have tried using a parallel map
operation with the multiprocessing
module to speed things up, but thought I'd ask here to see if there was a better way.
Can I achieve the same outcome as above in a more reasonable time?
Upvotes: 2
Views: 582
Reputation: 19957
Setup
my_series = pd.Series([{'A': [1], 'B' : []}, {'A' : [1, 2], 'B' : [3, 4]}])
df = pd.DataFrame.from_dict(s.tolist())
Solution
Ok, got a faster solution working.
idx = pd.MultiIndex.from_product([['A','B'],[0,1]])
pd.DataFrame(pd.DataFrame(df.values.flatten().tolist()).values.reshape(2,-1), columns=idx)
Out[1051]:
A B
0 1 0 1
0 1.0 NaN NaN NaN
1 1.0 2.0 3.0 4.0
Old Solution
#Convert list elements to columns
df_A = df.A.apply(pd.Series).stack().unstack()
df_B = df.B.apply(pd.Series).stack().unstack()
#rename columns
df_A.columns = ['A_' + str(e) for e in df_A.columns]
df_B.columns = ['B_' + str(e) for e in df_B.columns]
#combine two dataframes
pd.concat([df_A,df_B],axis=1)
Out[973]:
A_0 A_1 B_0 B_1
0 1.0 NaN NaN NaN
1 1.0 2.0 3.0 4.0
Test
%timeit pd.DataFrame(pd.DataFrame(df.values.flatten().tolist()).values.reshape(2,-1), columns=idx)
1000 loops, best of 3: 378 µs per loop
%timeit pd.concat([pd.DataFrame(df[x].values.tolist()) for x in df.columns], axis=1,keys=df.columns)
1000 loops, best of 3: 1.22 ms per loop
Upvotes: 0
Reputation: 863631
You can use DataFrame
constructor, but first convert Series
to numpy array
by values
and then to list
:
a = pd.DataFrame(my_series.values.tolist())
print (a)
A B
0 [1] []
1 [1, 2] [3, 4]
Then for flatennig is possible use concat
with list comprehension
:
b = pd.concat([pd.DataFrame(a[x].values.tolist()) for x in a.columns], axis=1,keys=a.columns)
print (b)
A B
0 1 0 1
0 1 NaN NaN NaN
1 1 2.0 3.0 4.0
If convert to numpy array is is faster:
In [93]: %timeit pd.DataFrame(list(my_series))
1000 loops, best of 3: 550 µs per loop
In [94]: %timeit pd.DataFrame(my_series.values.tolist())
1000 loops, best of 3: 516 µs per loop
Upvotes: 1
Reputation: 107347
First off, since you have your dictionaries in a pandas-based data structure you might be able to create a DataFrame
instead of a series.
Secondly DataFrame
can accept a list of dictionaries and construct the expected result for you. So, if you don't have the control over the construction of the series in the first place, you can just convert the series to list and pass it to DataFrame
:
In [10]: pd.DataFrame(list(my_series))
Out[10]:
A B
0 [1] []
1 [1, 2] [3, 4]
Upvotes: 1