Jaafa
Jaafa

Reputation: 141

Speed up concatenation of excel files with Pandas

I'm trying to merge multiple excel files with pandas using the following code:

import pandas as pd
from os.path import basename


df = []
for f in ['test1.xlsx', 'test2.xlsx']:
    data = pd.read_excel(f, 'Sheet1')
    data.index = [basename(f)] * len(data)
    df.append(data)

df = pd.concat(df)

df.to_excel("merged_data2.xlsx")

Which works fine on the test files but when trying it on 3 other excel files of size over 100mb each the process becomes too slow to be useful. I saw this other post on the subject: Why does concatenation of DataFrames get exponentially slower?

And I believe I have correctly followed the advice of using a list before concatenating but without success. Any ideas? Thanks.

Upvotes: 0

Views: 1257

Answers (2)

Nelson
Nelson

Reputation: 339

In a data analysis project with Excel files larger than 50Mb, I had great difficulty using Pandas.

For this reason, I decided to create an extension (C++) that reads Excel spreadsheets directly into Pandas dataframes. With it, I achieved good performance gains in this process.

If you are interested, the project for this extension (dll/so) is available here.

Important notice: This project was entirely created by me and it is completely free. I'm not trying to self-promote, I'm just trying to help.

Upvotes: 0

waterproof
waterproof

Reputation: 5184

I don't see anything wrong with your pandas code; a 300mb excel file might just be a lot for Pandas to handle! Here are some approaches I'd take:

Tactic 1. Investigate

If I were you, my next step in debugging this would be to throw some print(datetime.now()) statements into the loop, to see whether it's the reading, the concatenating, or the .to_excel that's taking time. That way you may be able to narrow down the problem. Also take a look at your memory usage using appropriate tools for whatever OS you're in.

Tactic 2. Try a different tool

Pandas is optimized for scientific computing and it probably spends quite a bit of time organizing the data for querying and such. ETL isn't it's primary purpose. If you only need to concatenate a few sheets, (as much as it pains me to suggest doing something manually!) manual work in Excel itself will likely be the quickest way to go - highly-paid engineers at Microsoft have been tasked with optimizing that. If you need a programmatic approach, it could be worth trying out petl or one of the tools discussed here that may take a simpler/more efficient approach than pandas.

Some example petl code that might do the trick:

import petl
petl.cat(*
    petl.io.fromxlsx(file)
    for file in ['your.xlsx', 'excel.xlsx', 'files.xlsx']
).progress().toxlsx()

Upvotes: 0

Related Questions