Reputation: 5117
I have a python
scripts which is about the following:
pandas
dataframepandas
dataframeThe issue with this is that at every iteration the processing time is increasing. Specifically:
0-1000 documents -> 5 seconds
1000-2000 documents -> 6 seconds
2000-3000 documents -> 7 seconds
...
10000-11000 documents -> 18 seconds
11000-12000 documents -> 19 seconds
...
22000-23000 documents -> 39 seconds
23000-24000 documents -> 42 seconds
...
34000-35000 documents -> 69 seconds
35000-36000 documents -> 72 seconds
Why is this happening?
My code looks like this:
# 'documents' is the list of jsons
columns = ['column_1', 'column_2', ..., 'column_19', 'column_20']
df_documents = pd.DataFrame(columns=columns)
for index, document in enumerate(documents):
dict_document = dict.fromkeys(columns)
...
(parsing the jsons and retrieve the values of the keys and assign them to the dictionary)
...
df_documents = df_documents.append(dict_document, ignore_index=True)
P.S.
After applying @eumiro's suggestion below the times are the following:
0-1000 documents -> 0.06 seconds
1000-2000 documents -> 0.05 seconds
2000-3000 documents -> 0.05 seconds
...
10000-11000 documents -> 0.05 seconds
11000-12000 documents -> 0.05 seconds
...
22000-23000 documents -> 0.05 seconds
23000-24000 documents -> 0.05 seconds
...
34000-35000 documents -> 0.05 seconds
35000-36000 documents -> 0.05 seconds
After applying @DariuszKrynicki's suggestion below the times are the following:
0-1000 documents -> 0.56 seconds
1000-2000 documents -> 0.54 seconds
2000-3000 documents -> 0.53 seconds
...
10000-11000 documents -> 0.51 seconds
11000-12000 documents -> 0.51 seconds
...
22000-23000 documents -> 0.51 seconds
23000-24000 documents -> 0.51 seconds
...
34000-35000 documents -> 0.51 seconds
35000-36000 documents -> 0.51 seconds
...
Upvotes: 1
Views: 2085
Reputation: 2718
I suspect your DataFrame is growing with each iteration. How about using iterators?
# documents = # json
def get_df_from_json(document):
columns = ['column_1', 'column_2', ..., 'column_19', 'column_20']
# parsing the jsons and retrieve the values of the keys and assign them to the dictionary)
# dict_document = # use document to parse it and create dictionary
return pd.DataFrame(list(dict_document.values()), index=dict_document)
res = (get_df_from_json(document) for document in enumerate(documents))
res = pd.concat(res).reset_index()
EDIT: I have made a quick comparison on such example as below and it turns out that iterator use does not speed up the code against list comprehension use:
import json
import time
def get_df_from_json():
dd = {'a': [1, 1], 'b': [2, 2]}
app_json = json.dumps(dd)
return pd.DataFrame(list(dd.values()), index=dd)
start = time.time()
res = pd.concat((get_df_from_json() for x in range(1,20000))).reset_index()
print(time.time() - start)
start = time.time()
res = pd.concat([get_df_from_json() for x in range(1,20000)]).reset_index()
print(time.time() - start)
iterator: 9.425999879837036 list comprehension: 8.934999942779541
Upvotes: 1
Reputation: 212825
Yes, append
ing to a DataFrame will be slower after each new line, because it has to copy the whole (growing) content again and again.
Create a simple list, append to it and then create one DataFrame in one step:
records = []
for index, document in enumerate(documents):
…
records.append(dict_document)
df_documents = pd.DataFrame.from_records(records)
Upvotes: 8
Reputation: 20472
The answer could already lie in the pandas.DataFrame.append
method which you are constantly using. This is very inefficient, as it needs to allocate new memory frequently, i.e. copying the old one, which could explain your results. See also the official pandas.DataFrame.append docs for this:
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
with the two examples:
Less efficient:
>>> df = pd.DataFrame(columns=['A']) >>> for i in range(5): ... df = df.append({'A': i}, ignore_index=True) >>> df A 0 0 1 1 2 2 3 3 4 4
More efficient:
>>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)], ... ignore_index=True) A 0 0 1 1 2 2 3 3 4 4
You can apply the same strategy, create a list of dataframes instead of appending to the same dataframe with each iteration, then concat
once your for
loop is finished
Upvotes: 3