Reputation: 740
I have a really simple bit of code, where I have a group of file names and I need to open each one and extract some data to later manipulate.
for file in unique_file_names[1:]:
file_name = rootdir + "/" + str(file)
test_time = time.clock()
try:
wb_loop = load_workbook(file_name, read_only=True, data_only=True)
ws_loop = wb_loop["SHEET1"]
df = pd.DataFrame(ws_loop.values)
print("Opening Workbook: ", time.clock()-test_time)
newarray = np.vstack((newarray, df.loc[4:43,:13].values))
print("Data Manipulation: ", time.clock()-test_time)
So I've tried a few different modules to read in excel files, including directly using pandas.read_excel() and this is the optimum method, managing to get the time to open the workbook to 1.5-2s, and the numpy stacking takes 0.03 seconds ish.
I think allocating the data to a third dimension in the array based on an index would probably be quicker but I'm more focused on speeding up the time to load the spreadsheets, any suggestions?
Edit: I did also create a multithread pool to try and speed this up but for some reason it started using 15Gb ram and crashed my computer
Edit 2:
So the fastest way this was done was using xlrd as per the accepted answers recommendation. I also realised that it was quicker to delete the workbook at the end of the loop. The final code looks like
for file in unique_file_names[1:]:
file_name = rootdir + "/" + str(file)
test_time = time.clock()
try:
wb_loop = xlrd.open_workbook(file_name, on_demand = True)
ws_loop = wb_loop.sheet_by_name("Sheet1")
print("Opening Workbook: ", time.clock()-test_time)
df = pd.DataFrame([ws_loop.row_values(n) for n in range(ws_loop.nrows)])
newarray = np.vstack((newarray, df.loc[4:43,:13].values))
del wb_loop
print("Data Manipulation: ", time.clock()-test_time)
except:
pass
counter+=1
print("%s %% Done" %(counter*100/len(unique_file_names)))
wb_new = xlwt.Workbook()
ws_new = wb_new.add_sheet("Test")
ws_new.write(newarray)
wb_new.save(r"C:Libraries/Documents/NewOutput.xls")
This outputs an average time per loop of 1.6-1.8s. Thanks for everyones help.
Upvotes: 11
Views: 933
Reputation: 1979
Here is a quick benchmark (extending this one). Apparently, using xlrd directly is slightly faster than pandas for the test .xlsx file. If .csv files are available, reading them is definitely much faster but converting them using LibreOffice is substantially slower:
pd_base 1.96 [in seconds]
pd_float 2.03
pd_object 2.01 [see cs95´s comment to your question]
pd_xlrd 1.95
pyxl_base 2.15
xlrd_base 1.79
csv_ready 0.17
csv_convert 18.72
Here is the code:
import pandas as pd
import openpyxl
import xlrd
import subprocess
file = 'test.xlsx'
df = pd.DataFrame([[i+j for i in range(50)] for j in range(100)])
df.to_excel(file, index=False)
df.to_csv(file.replace('.xlsx', '.csv'), index=False)
def pd_base():
df = pd.read_excel(file)
def pd_float():
df = pd.read_excel(file, dtype=np.int)
def pd_object():
df = pd.read_excel(file, sheet_name="Sheet1", dtype=object)
def pd_xlrd():
df = pd.read_excel(file, engine='xlrd')
def pyxl_base():
wb = openpyxl.load_workbook(file, read_only=True, keep_links=False, data_only=True)
sh = wb.active
df = pd.DataFrame(sh.values)
def xlrd_base():
wb = xlrd.open_workbook(file)
sh = wb.sheet_by_index(0)
df = pd.DataFrame([sh.row_values(n) for n in range(sh.nrows)])
def csv_ready():
df = pd.read_csv(file.replace('.xlsx', '.csv'))
def csv_convert():
out = subprocess.check_output(['libreoffice --headless --convert-to csv test.xlsx'], shell=True, stderr=subprocess.STDOUT)
df = pd.read_csv(file.replace('.xlsx', '.csv'))
def measure(func, nums=50):
temp = time.time()
for num in range(nums):
func()
diff = time.time() - temp
print(func.__name__, '%.2f' % diff)
for func in [pd_base, pd_float, pd_object, pd_xlrd, pyxl_base, xlrd_base, csv_ready, csv_convert]:
measure(func)
Upvotes: 2
Reputation: 83
Two tips:
If you don't provide a type, most libraries will use the largest possible value (64 bits). If even after controlling for types your data won't fit in memory, you need to think about partitioning and spilling to disk.
Below an example of your code written controlling for data types and with the Executor interface
from concurrent.futures import ProcessPoolExecutor
from openpyxl import load_workbook
import pandas as pd
import numpy as np
def load_single(file):
file_name = rootdir + "/" + str(file)
wb_loop = load_workbook(file_name, read_only=True, data_only=True)
ws_loop = wb_loop["SHEET1"]
df = pd.DataFrame(ws_loop.values)
partial_array = df.loc[4:43, :13].values.astype(np.float32)
return partial_array
def run():
executor = ProcessPoolExecutor(max_workers=4)
files = unique_file_names[1:]
results = executor.map(load_single, files)
new_array = np.empty((0, 39), dtype=np.float32)
for partial_array in results:
new_array = np.vstack([new_array, partial_array])
Upvotes: 0