Reputation: 1046
Thousands of dfs of consistent columns are being generated in a for loop reading different files, and I'm trying to merge / concat / append them into a single df, combined
:
combined = pd.DataFrame()
for i in range(1,1000): # demo only
global combined
generate_df() # df is created here
combined = pd.concat([combined, df])
This is initially fast but slows as combined
grows, eventually becoming unusably slow. This answer on how to append rows explains how adding rows to a dict and then creating a df is most efficient but I can't figure out how to do that with to_dict
.
What's a good way to to this? Am I approaching this the wrong way?
Upvotes: 6
Views: 11158
Reputation: 7522
concat
only once at the end.concat
from 10 + seconds to less than one second!dfs = []
for i in range(1,1000): # demo only
global combined
df = generate_df() # df is created here
df.sort_index(inplace=True)
dfs.append(df)
combined = pd.concat(dfs)
Upvotes: 2
Reputation: 594
The fastest way is building a list of dictionaries and building the dataframe only once at the end:
rows = []
for i in range(1, 1000):
# Instead of generating a dataframe, generate a dictionary
dictionary = generate_dictionary()
rows.append(dictionary)
combined = pd.DataFrame(rows)
This is about 100 times faster that concatenating dataframes, as is proved by the benchmark here.
Upvotes: 12
Reputation: 863351
You can create list of DataFrames and then use concat
only once:
dfs = []
for i in range(1,1000): # demo only
global combined
generate_df() # df is created here
dfs.append(df)
combined = pd.concat(dfs)
Upvotes: 7