Reputation: 13
I have a program that imports raw data, interprets them, generates an excel file, and writes information to multiple sheets based on what was interpreted.
I inherited this program from someone else. They wrote the program using openpyxl, and they have several functions that load and save the excel file. These functions are called several times throughout the program. The larger the raw data file is, the more times the functions are called. This takes a lot of time for the program to run. A raw data file of 260kb takes over 2 hours for my laptop to process. I am struggling to make the excel file only save once, but that is what I hope to accomplish.
Here is how it was originally written in the createExcel.py file the previous developer wrote. All of these functions are called by main.py and other .py files several times.
(in createExcel.py)
def create_excel(file_name):
# create the shell/format of the sheets in the excel file
# but does not write anything other than headers
wb = Workbook()
...# lots of code
wb.save(f'{file_name}.xlsx')
def insert_sheet_1_data(array):
# take data that was interpreted in another .py file and add
# the information to the relevant sheet in the excelfile
wb = load_workbook(f"{fileName}.xlsx")
...
wb.save(f'{file_name}.xlsx')
...
...
def insert_sheet_5_data(array):
# take data that was interpreted in another .py file and add
# the information to the relevant sheet in the excelfile
wb = load_workbook(f"{fileName}.xlsx")
...
wb.save(f'{file_name}.xlsx')
I tried declaring wb = Workbook outside of the functions in the createExcel.py. I then deleted all of the load_workbook calls and wb.save calls in the above functions. Then I imported createExcel into the main.py file, and wrote a line to save the wb file as the last line of code in main.py:
(in createExcel.py)
wb = Workbook()
def create_excel(file_name):
# create the shell/format of the sheets in the excel file
# but does not write anything other than headers
def insert_sheet_1_data(array):
# take data that was interpreted in another .py file and add
# the information to the relevant sheet in the excelfile
...
def insert_sheet_5_data(array):
# take data that was interpreted in another .py file and add
# the information to the relevant sheet in the excelfile
(in main.py)
import createExcel
...
#rest of the code
wb = createExcel.wb
wb.save(f'{file_name}'
# end
The original code works, but it takes an incredibly long time to execute. The new code is much much faster, and through debug mode I see that it still goes through and execeutes all of the openpyxl related calls, but it never creates an excel file on my computer (that I can find), and it doesn't throw any errors about trying to save a file that doesn't exist, so I am not sure what to make of it.
Any insight as to what I am doing wrong here would be really appreciated! Thanks so much.
Upvotes: 1
Views: 319
Reputation: 20460
You mentioned that it takes ~ one minute per KiB to process a spreadsheet. That seems very very high. There must be some relevant details that the original post does not mention.
You wrote
# (in createExcel.py)
wb = Workbook()
That is not a Best Practice; please don't do that.
More generally, avoid doing time consuming work at import
time
and especially avoid ".save()" side effects at that time.
Better to defer such actions until run time, when a function
has been explicitly called.
At the end of def create_excel(...)
, consider appending
return wb
so you can communicate results through the call stack instead of through the filesystem.
You also wrote
# (in main.py)
import createExcel
...
# other things
wb = createExcel.wb
wb.save(f'{file_name}'
The import
is fine, but grabbing a reference to createExcel.wb
is not -- you don't want the import
to spend time creating that object.
Better that you assign wb = createExcel.create_excel(...)
.
That is, defer doing the work until the caller actually needs that object.
Consider defining
def other_things(...):
...
so you can call it when appropriate, rather than at import
time.
An overall theme to this code is that you wish to avoid doing useless repeated work. Consider structuring the code in this way:
class MyWorkbook:
def __init__():
self.wb = Workbook() # or even: self.wb = self.create_excel(...)
Now all the various methods can cheaply access
the existing self.wb
attribute, perhaps
adding sheets to it as they see fit.
At the end you can .save()
just once.
Upvotes: 1