Speeding Up Excel Data to Pandas

Question

I have a really simple bit of code, where I have a group of file names and I need to open each one and extract some data to later manipulate.

for file in unique_file_names[1:]:
        file_name = rootdir + "/" + str(file)
        test_time = time.clock()    
        try:
            wb_loop = load_workbook(file_name, read_only=True, data_only=True)
            ws_loop = wb_loop["SHEET1"]
            df = pd.DataFrame(ws_loop.values)
            print("Opening Workbook:         ", time.clock()-test_time)

            newarray = np.vstack((newarray, df.loc[4:43,:13].values))
            print("Data Manipulation:         ", time.clock()-test_time)

So I've tried a few different modules to read in excel files, including directly using pandas.read_excel() and this is the optimum method, managing to get the time to open the workbook to 1.5-2s, and the numpy stacking takes 0.03 seconds ish.

I think allocating the data to a third dimension in the array based on an index would probably be quicker but I'm more focused on speeding up the time to load the spreadsheets, any suggestions?

Edit: I did also create a multithread pool to try and speed this up but for some reason it started using 15Gb ram and crashed my computer

Edit 2:

So the fastest way this was done was using xlrd as per the accepted answers recommendation. I also realised that it was quicker to delete the workbook at the end of the loop. The final code looks like

for file in unique_file_names[1:]:
        file_name = rootdir + "/" + str(file)
        test_time = time.clock()    
        try:
            wb_loop = xlrd.open_workbook(file_name, on_demand = True)
            ws_loop = wb_loop.sheet_by_name("Sheet1")
            print("Opening Workbook:         ", time.clock()-test_time)

            df = pd.DataFrame([ws_loop.row_values(n) for n in  range(ws_loop.nrows)])            

            newarray = np.vstack((newarray, df.loc[4:43,:13].values))
            del wb_loop

            print("Data Manipulation:         ", time.clock()-test_time)

        except:
            pass
        counter+=1
        print("%s %% Done" %(counter*100/len(unique_file_names)))

    wb_new = xlwt.Workbook()
    ws_new = wb_new.add_sheet("Test")
    ws_new.write(newarray)
    wb_new.save(r"C:Libraries/Documents/NewOutput.xls")

This outputs an average time per loop of 1.6-1.8s. Thanks for everyones help.

David · Accepted Answer

Here is a quick benchmark (extending this one). Apparently, using xlrd directly is slightly faster than pandas for the test .xlsx file. If .csv files are available, reading them is definitely much faster but converting them using LibreOffice is substantially slower:

pd_base 1.96 [in seconds]
pd_float 2.03
pd_object 2.01 [see cs95´s comment to your question]
pd_xlrd 1.95
pyxl_base 2.15
xlrd_base 1.79
csv_ready 0.17
csv_convert 18.72

Here is the code:

import pandas as pd
import openpyxl
import xlrd
import subprocess

file = 'test.xlsx'
df = pd.DataFrame([[i+j for i in range(50)] for j in range(100)])
df.to_excel(file, index=False)
df.to_csv(file.replace('.xlsx', '.csv'), index=False)

def pd_base():
    df = pd.read_excel(file)
def pd_float():
    df = pd.read_excel(file, dtype=np.int)
def pd_object():
    df = pd.read_excel(file, sheet_name="Sheet1", dtype=object)
def pd_xlrd():
    df = pd.read_excel(file, engine='xlrd')
def pyxl_base():
    wb = openpyxl.load_workbook(file, read_only=True, keep_links=False, data_only=True)
    sh = wb.active
    df = pd.DataFrame(sh.values)
def xlrd_base():
    wb = xlrd.open_workbook(file)
    sh = wb.sheet_by_index(0)
    df = pd.DataFrame([sh.row_values(n) for n in  range(sh.nrows)])
def csv_ready():    
    df = pd.read_csv(file.replace('.xlsx', '.csv'))
def csv_convert():    
    out = subprocess.check_output(['libreoffice --headless --convert-to csv test.xlsx'], shell=True, stderr=subprocess.STDOUT)
    df = pd.read_csv(file.replace('.xlsx', '.csv'))

def measure(func, nums=50):
    temp = time.time()
    for num in range(nums):
        func()
    diff = time.time() - temp
    print(func.__name__, '%.2f' % diff)

for func in [pd_base, pd_float, pd_object, pd_xlrd, pyxl_base, xlrd_base, csv_ready, csv_convert]:
    measure(func)

Speeding Up Excel Data to Pandas

Answers (2)

Related Questions