Reputation: 1316
I recently managed to create a program the reads data from excel, edit it and rewrite it along with new columns and it works good, but the issue is the performance if the excel file contains 1000 rows it finishes in less than 2 mins but if it contains 10-15k rows, it can take 3-4 hours and the more I have rows the more it becomes exponentially slower which doesnt make sense for me.
My code:
Reading from xls excel:
def xls_to_dict(workbook_url):
workbook_dict = {}
book = xlrd.open_workbook(workbook_url)
sheets = book.sheets()
for sheet in sheets:
workbook_dict[sheet.name] = {}
columns = sheet.row_values(0)
rows = []
for row_index in range(1, sheet.nrows):
row = sheet.row_values(row_index)
rows.append(row)
return rows
return workbook_dict
data = xls_to_dict(filename)
Writing in the excel:
rb = open_workbook(filename, formatting_info=True)
r_sheet = rb.sheet_by_index(0)
wb = copy(rb)
w_sheet = wb.get_sheet(0)
I read and found a package called Pandas that reads xlsx and tried working on it, but failed to access the data from the DataFrame to be a dictionary. So couldn't edit it and rewrite it to compare the performance.
My code:
fee = pd.read_excel(filename)
My input row data file is:
ID. NAME. FAMILY. DOB Country Description
My output file is: ID. NAME. FAMILY. DOB Country ModifiedDescription NATIONALITY
Any advice will be appreciated.
Upvotes: 0
Views: 462
Reputation: 4130
You can remove iterating over rows by converting sheet data to a dataframe and get values as list.
from openpyxl import load_workbook
from datetime import datetime,timedelta
from dateutil.relativedelta import relativedelta
def xls_to_dict(workbook_url):
xl = pd.ExcelFile(workbook_url)
workbook_dict = {}
for sheet in xl.sheet_names:
df = pd.read_excel(xl, sheet)
columns = df.columns
rows = df.values.tolist()
workbook_dict[sheet] = rows
return workbook_dict,columns
data,columns = xls_to_dict(filename)
for saving also you can remove for loop by using a dataframe
xl = pd.ExcelFile(filename)
sheet_name = xl.sheet_names[0] #sheet by index
df = pd.read_excel(xl, sheet_name)
df["DOB"] = pd.to_datetime(df["DOB"])
df["age"] = df["DOB"].apply(lambda x: abs(relativedelta(datetime.today(),x).years))
df["nationality"] = #logic to calculate nationality
book = load_workbook(filename)
writer = pd.ExcelWriter(filename, engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df.to_excel(writer, sheet_name)
writer.save()
Upvotes: 1