gowithefloww
gowithefloww

Reputation: 2251

Filling a dataframe with high speed

I have a pandas Series where each cells is a dictionary :

data.individus.head(5)
Out[25]: 
0    [{'dateDeNaissance': 1954-09-14 00:00:00, 'enc...
1    [{'dateDeNaissance': 1984-09-14 00:00:00, 'enc...
2    [{'enceinte': False, 'dateDeNaissance': 1981-0...
3    [{'dateDeNaissance': 1989-09-14 00:00:00, 'enc...
4    [{'enceinte': False, 'dateDeNaissance': 1989-0...
Name: individus, dtype: object

I would like to construct a pandas Dataframe using each dictionnary, like so :

t_individus.ix[:, ['dateDeNaissance', 'enceinte']].head()
Out[14]: 
       dateDeNaissance enceinte
0  1954-09-14 00:00:00    False
1  1984-09-14 00:00:00    False
2  1981-09-14 00:00:00    False
3  1989-09-14 00:00:00    False
4  1989-09-14 00:00:00    False

Note that I have many more keys (~50) but i'm showing 2 for the example.

I tried 2 different ways but I'm not entirely satisfied with the processing speed :

1/ Concatening

serie = data.foo  #  110199 lines
keys = get_all_possible_keys(serie)  # 48 keys (process time: 0.8s)
table = pd.DataFrame(columns=list(keys))

for i in serie:
    df = pd.DataFrame(list(i.items()))
    df = df.transpose()
    df.columns = df.iloc[0]
    df = df.reindex(df.index.drop(0))
    table = pd.concat([table, df], axis=0)

It starts fast and then slowly decrease while table gets bigger. Overall process takes around 1 hours.

2/ Pre-allocate memory and filling row by row

serie = data.foo
keys = get_all_possible_keys(serie)
len_serie = len(serie)
# -- Pre-allocate memory by declaring size
table = pd.DataFrame(np.nan,
                     index=range(0, len_serie),
                     columns=list(keys))
# -- Fill row by row
k = 0
for i in serie:
    table.loc[k] = pd.Series(i[0])
    k += 1

Processing time depends on table's size. It's much faster with a subset (~10k lines) and gets incredibly slower with the full dataset (110k lines).

2 Questions :

  1. Why method 2 gets so slow when table is big (much slower than method 1) while its only filling empty rows ?
  2. Any ideas on how I could speed-up the process ?

Upvotes: 0

Views: 452

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210942

It's almost the same idea as @James's, but in your case you have a series of lists of dicts, which you want to convert to list of dicts or to series of dicts first:

In [12]: s
Out[12]:
0    [{'a': 'aaa', 'b': 'bbb', 'c': 'ccc'}]
1       [{'a': 'a1', 'b': 'b1', 'c': 'c1'}]
dtype: object

In [13]: pd.DataFrame(s.sum())
Out[13]:
     a    b    c
0  aaa  bbb  ccc
1   a1   b1   c1

In [14]: s.sum()
Out[14]: [{'a': 'aaa', 'b': 'bbb', 'c': 'ccc'}, {'a': 'a1', 'b': 'b1', 'c': 'c1'}]

using .tolist():

In [15]: pd.DataFrame(s.tolist())
Out[15]:
                                      0
0  {'a': 'aaa', 'b': 'bbb', 'c': 'ccc'}
1     {'a': 'a1', 'b': 'b1', 'c': 'c1'}

Upvotes: 2

James Elderfield
James Elderfield

Reputation: 2507

I have found in the past that it is surprisingly quick to build a dataframe from a list of dicts. My simple suggestion would be to try,

dataframe = pandas.DataFrame(data.foo.tolist())

Upvotes: 3

Related Questions