Outcast
Outcast

Reputation: 5117

Loop gets slower after each iteration

I have a python scripts which is about the following:

  1. I have a list of jsons
  2. I create an empty pandas dataframe
  3. I run a for loop on this list
  4. I create an empty dictionary at every iteration with the (same) keys which are interesting for me
  5. I parse the json at every iteration to retrieve the values of the keys
  6. I append the dictionary at every iteration to the pandas dataframe

The issue with this is that at every iteration the processing time is increasing. Specifically:

0-1000 documents -> 5 seconds
1000-2000 documents -> 6 seconds
2000-3000 documents -> 7 seconds
...
10000-11000 documents -> 18 seconds
11000-12000 documents -> 19 seconds
...
22000-23000 documents -> 39 seconds
23000-24000 documents -> 42 seconds
...
34000-35000 documents -> 69 seconds
35000-36000 documents -> 72 seconds

Why is this happening?

My code looks like this:

# 'documents' is the list of jsons

columns = ['column_1', 'column_2', ..., 'column_19', 'column_20']

df_documents = pd.DataFrame(columns=columns)

for index, document in enumerate(documents):

    dict_document = dict.fromkeys(columns)

    ...
    (parsing the jsons and retrieve the values of the keys and assign them to the dictionary)
    ...

    df_documents = df_documents.append(dict_document, ignore_index=True)

P.S.

After applying @eumiro's suggestion below the times are the following:

    0-1000 documents -> 0.06 seconds
    1000-2000 documents -> 0.05 seconds
    2000-3000 documents -> 0.05 seconds
    ...
    10000-11000 documents -> 0.05 seconds
    11000-12000 documents -> 0.05 seconds
    ...
    22000-23000 documents -> 0.05 seconds
    23000-24000 documents -> 0.05 seconds
    ...
    34000-35000 documents -> 0.05 seconds
    35000-36000 documents -> 0.05 seconds

After applying @DariuszKrynicki's suggestion below the times are the following:

0-1000 documents -> 0.56 seconds
1000-2000 documents -> 0.54 seconds
2000-3000 documents -> 0.53 seconds
...
10000-11000 documents -> 0.51 seconds
11000-12000 documents -> 0.51 seconds
...
22000-23000 documents -> 0.51 seconds
23000-24000 documents -> 0.51 seconds
...
34000-35000 documents -> 0.51 seconds
35000-36000 documents -> 0.51 seconds
...

Upvotes: 1

Views: 2085

Answers (3)

Dariusz Krynicki
Dariusz Krynicki

Reputation: 2718

I suspect your DataFrame is growing with each iteration. How about using iterators?

# documents = # json
def get_df_from_json(document):
    columns = ['column_1', 'column_2', ..., 'column_19', 'column_20']
    # parsing the jsons and retrieve the values of the keys and assign them to the dictionary)
    # dict_document =  # use document to parse it and create dictionary
    return pd.DataFrame(list(dict_document.values()), index=dict_document)   

res = (get_df_from_json(document) for document in enumerate(documents))
res = pd.concat(res).reset_index() 

EDIT: I have made a quick comparison on such example as below and it turns out that iterator use does not speed up the code against list comprehension use:

import json
import time


def get_df_from_json():
    dd = {'a': [1, 1], 'b': [2, 2]}
    app_json = json.dumps(dd)
    return pd.DataFrame(list(dd.values()), index=dd)

start = time.time()
res = pd.concat((get_df_from_json() for x in range(1,20000))).reset_index()
print(time.time() - start)


start = time.time()
res = pd.concat([get_df_from_json() for x in range(1,20000)]).reset_index()
print(time.time() - start)

iterator: 9.425999879837036 list comprehension: 8.934999942779541

Upvotes: 1

eumiro
eumiro

Reputation: 212825

Yes, appending to a DataFrame will be slower after each new line, because it has to copy the whole (growing) content again and again.

Create a simple list, append to it and then create one DataFrame in one step:

records = []

for index, document in enumerate(documents):
    …
    records.append(dict_document)

df_documents = pd.DataFrame.from_records(records)

Upvotes: 8

FlyingTeller
FlyingTeller

Reputation: 20472

The answer could already lie in the pandas.DataFrame.append method which you are constantly using. This is very inefficient, as it needs to allocate new memory frequently, i.e. copying the old one, which could explain your results. See also the official pandas.DataFrame.append docs for this:

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

with the two examples:

Less efficient:

>>> df = pd.DataFrame(columns=['A'])
>>> for i in range(5): ...     df = df.append({'A': i}, ignore_index=True)
>>> df    A 0  0 1  1 2  2 3  3 4  4

More efficient:

>>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)], ...           ignore_index=True)    A 0  0 1  1 2  2 3  3 4  4

You can apply the same strategy, create a list of dataframes instead of appending to the same dataframe with each iteration, then concat once your for loop is finished

Upvotes: 3

Related Questions