SMH
SMH

Reputation: 1316

Reading data from excel and rewriting it with a new column PYTHON

I recently managed to create a program the reads data from excel, edit it and rewrite it along with new columns and it works good, but the issue is the performance if the excel file contains 1000 rows it finishes in less than 2 mins but if it contains 10-15k rows, it can take 3-4 hours and the more I have rows the more it becomes exponentially slower which doesnt make sense for me.

My code:

Reading from xls excel:

 def xls_to_dict(workbook_url):
        workbook_dict = {}
        book = xlrd.open_workbook(workbook_url)
        sheets = book.sheets()
        for sheet in sheets:
            workbook_dict[sheet.name] = {}
            columns = sheet.row_values(0)
            rows = []
            for row_index in range(1, sheet.nrows):
                row = sheet.row_values(row_index)
                rows.append(row)
            return rows
        return workbook_dict

data = xls_to_dict(filename)

Writing in the excel:

rb = open_workbook(filename, formatting_info=True)
r_sheet = rb.sheet_by_index(0)
wb = copy(rb)
w_sheet = wb.get_sheet(0)

I read and found a package called Pandas that reads xlsx and tried working on it, but failed to access the data from the DataFrame to be a dictionary. So couldn't edit it and rewrite it to compare the performance.

My code:

fee = pd.read_excel(filename)

My input row data file is:

ID. NAME. FAMILY. DOB Country Description

My output file is: ID. NAME. FAMILY. DOB Country ModifiedDescription NATIONALITY

Any advice will be appreciated.

Upvotes: 0

Views: 462

Answers (1)

Rajith Thennakoon
Rajith Thennakoon

Reputation: 4130

You can remove iterating over rows by converting sheet data to a dataframe and get values as list.

from openpyxl import load_workbook
from datetime import datetime,timedelta
from dateutil.relativedelta import relativedelta

def xls_to_dict(workbook_url):
    xl = pd.ExcelFile(workbook_url)   
    workbook_dict = {}
    for sheet in xl.sheet_names:      
        df = pd.read_excel(xl, sheet) 
        columns = df.columns          
        rows = df.values.tolist()     
        workbook_dict[sheet] = rows
    return workbook_dict,columns

data,columns = xls_to_dict(filename)

for saving also you can remove for loop by using a dataframe

xl = pd.ExcelFile(filename) 
sheet_name = xl.sheet_names[0] #sheet by index
df = pd.read_excel(xl, sheet_name)
df["DOB"] = pd.to_datetime(df["DOB"])
df["age"] = df["DOB"].apply(lambda x: abs(relativedelta(datetime.today(),x).years))
df["nationality"] = #logic to calculate nationality 

book = load_workbook(filename)
writer = pd.ExcelWriter(filename, engine='openpyxl') 
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df.to_excel(writer, sheet_name)
writer.save()

Upvotes: 1

Related Questions