Reputation: 1046

How to concat thousands of pandas dataframes generated by a for loop efficiently?

Thousands of dfs of consistent columns are being generated in a for loop reading different files, and I'm trying to merge / concat / append them into a single df, combined:

combined = pd.DataFrame()

for i in range(1,1000): # demo only
    global combined
    generate_df() # df is created here
    combined = pd.concat([combined, df])

This is initially fast but slows as combined grows, eventually becoming unusably slow. This answer on how to append rows explains how adding rows to a dict and then creating a df is most efficient but I can't figure out how to do that with to_dict.

What's a good way to to this? Am I approaching this the wrong way?

Upvotes: 6

Answers (3)

ChaimG

Reputation: 7522

Use concat only once at the end.
Sort the index of each DataFrame. In my production code this sort didn't take long yet reduced the processing time of concat from 10 + seconds to less than one second!

dfs = []

for i in range(1,1000): # demo only
    global combined
    df = generate_df() # df is created here
    df.sort_index(inplace=True)    
    dfs.append(df)

combined = pd.concat(dfs)

Upvotes: 2

Francesco Pasa

Reputation: 594

The fastest way is building a list of dictionaries and building the dataframe only once at the end:

rows = []

for i in range(1, 1000):
    # Instead of generating a dataframe, generate a dictionary
    dictionary = generate_dictionary()
    rows.append(dictionary)

combined = pd.DataFrame(rows)

This is about 100 times faster that concatenating dataframes, as is proved by the benchmark here.

Upvotes: 12

jezrael

Reputation: 863351

You can create list of DataFrames and then use concat only once:

dfs = []

for i in range(1,1000): # demo only
    global combined
    generate_df() # df is created here
    dfs.append(df)

combined = pd.concat(dfs)

Upvotes: 7

How to concat thousands of pandas dataframes generated by a for loop efficiently?

Answers (3)

Related Questions