How do I load and save only once with openpyxl? My program loads and saves the current excel file several times, which causes it to be very slow

Question

I have a program that imports raw data, interprets them, generates an excel file, and writes information to multiple sheets based on what was interpreted.

I inherited this program from someone else. They wrote the program using openpyxl, and they have several functions that load and save the excel file. These functions are called several times throughout the program. The larger the raw data file is, the more times the functions are called. This takes a lot of time for the program to run. A raw data file of 260kb takes over 2 hours for my laptop to process. I am struggling to make the excel file only save once, but that is what I hope to accomplish.

Here is how it was originally written in the createExcel.py file the previous developer wrote. All of these functions are called by main.py and other .py files several times.

(in createExcel.py)

def create_excel(file_name):
        # create the shell/format of the sheets in the excel file 
        # but does not write anything other than headers

        wb = Workbook()
        ...# lots of code
        wb.save(f'{file_name}.xlsx')


def insert_sheet_1_data(array):
        # take data that was interpreted in another .py file and add
        # the information to the relevant sheet in the excelfile

        wb = load_workbook(f"{fileName}.xlsx")
        ...
        wb.save(f'{file_name}.xlsx')

...

...

def insert_sheet_5_data(array):
        # take data that was interpreted in another .py file and add
        # the information to the relevant sheet in the excelfile

        wb = load_workbook(f"{fileName}.xlsx")
        ...
        wb.save(f'{file_name}.xlsx')

I tried declaring wb = Workbook outside of the functions in the createExcel.py. I then deleted all of the load_workbook calls and wb.save calls in the above functions. Then I imported createExcel into the main.py file, and wrote a line to save the wb file as the last line of code in main.py:

(in createExcel.py)

wb = Workbook()

def create_excel(file_name):
        # create the shell/format of the sheets in the excel file 
        # but does not write anything other than headers


def insert_sheet_1_data(array):
        # take data that was interpreted in another .py file and add
        # the information to the relevant sheet in the excelfile

...

def insert_sheet_5_data(array):
        # take data that was interpreted in another .py file and add
        # the information to the relevant sheet in the excelfile

(in main.py)

import createExcel
...

#rest of the code

wb = createExcel.wb
wb.save(f'{file_name}'
# end

The original code works, but it takes an incredibly long time to execute. The new code is much much faster, and through debug mode I see that it still goes through and execeutes all of the openpyxl related calls, but it never creates an excel file on my computer (that I can find), and it doesn't throw any errors about trying to save a file that doesn't exist, so I am not sure what to make of it.

Any insight as to what I am doing wrong here would be really appreciated! Thanks so much.

J_H · Accepted Answer

You mentioned that it takes ~ one minute per KiB to process a spreadsheet. That seems very very high. There must be some relevant details that the original post does not mention.

You wrote

# (in createExcel.py)

wb = Workbook()

That is not a Best Practice; please don't do that. More generally, avoid doing time consuming work at import time and especially avoid ".save()" side effects at that time. Better to defer such actions until run time, when a function has been explicitly called.

At the end of def create_excel(...), consider appending

    return wb

so you can communicate results through the call stack instead of through the filesystem.

You also wrote

# (in main.py)

import createExcel
...

# other things

wb = createExcel.wb
wb.save(f'{file_name}'

The import is fine, but grabbing a reference to createExcel.wb is not -- you don't want the import to spend time creating that object. Better that you assign wb = createExcel.create_excel(...). That is, defer doing the work until the caller actually needs that object.

Consider defining

def other_things(...):
    ...

so you can call it when appropriate, rather than at import time.

An overall theme to this code is that you wish to avoid doing useless repeated work. Consider structuring the code in this way:

class MyWorkbook:

    def __init__():
        self.wb = Workbook()  # or even: self.wb = self.create_excel(...)

Now all the various methods can cheaply access the existing self.wb attribute, perhaps adding sheets to it as they see fit. At the end you can .save() just once.

How do I load and save only once with openpyxl? My program loads and saves the current excel file several times, which causes it to be very slow

Answers (1)

Related Questions