Chris Dixon
Chris Dixon

Reputation: 1046

How to concat thousands of pandas dataframes generated by a for loop efficiently?

Thousands of dfs of consistent columns are being generated in a for loop reading different files, and I'm trying to merge / concat / append them into a single df, combined:

combined = pd.DataFrame()

for i in range(1,1000): # demo only
    global combined
    generate_df() # df is created here
    combined = pd.concat([combined, df])

This is initially fast but slows as combined grows, eventually becoming unusably slow. This answer on how to append rows explains how adding rows to a dict and then creating a df is most efficient but I can't figure out how to do that with to_dict.

What's a good way to to this? Am I approaching this the wrong way?

Upvotes: 6

Views: 11158

Answers (3)

ChaimG
ChaimG

Reputation: 7522

  • Use concat only once at the end.
  • Sort the index of each DataFrame. In my production code this sort didn't take long yet reduced the processing time of concat from 10 + seconds to less than one second!

dfs = []

for i in range(1,1000): # demo only
    global combined
    df = generate_df() # df is created here
    df.sort_index(inplace=True)    
    dfs.append(df)

combined = pd.concat(dfs)

Upvotes: 2

Francesco Pasa
Francesco Pasa

Reputation: 594

The fastest way is building a list of dictionaries and building the dataframe only once at the end:

rows = []

for i in range(1, 1000):
    # Instead of generating a dataframe, generate a dictionary
    dictionary = generate_dictionary()
    rows.append(dictionary)

combined = pd.DataFrame(rows)

This is about 100 times faster that concatenating dataframes, as is proved by the benchmark here.

Upvotes: 12

jezrael
jezrael

Reputation: 863351

You can create list of DataFrames and then use concat only once:

dfs = []

for i in range(1,1000): # demo only
    global combined
    generate_df() # df is created here
    dfs.append(df)

combined = pd.concat(dfs)

Upvotes: 7

Related Questions