Reputation: 33

Python: load excel header without loading remaining data

I am working with very big Excel files, which take a long time to be loaded with Pandas in Python. Before processing the data, the user has to select quite a few options related to the data, which only require the names of the each column in each dataset. It is very inconvenient for the user to have to wait sometimes minutes until the data is loaded to be able to select the necessary options and then let the program do the actual processing for another few minutes.

So, my question is: is there a way to load only the data header from an Excel file with Python? In a way I think of it as an alternate version to the "skiprows" parameter in the read_excel Pandas function, where instead of skipping rows in the beginning of the data, I would like to skip rows at the end of the data. I want to emphasize that my goal is to reduce the time Python takes to load the files. I also know there are ways to do this with csv files, but unfortunately it didn't help me.

Thank you for the help!

Upvotes: 1

Answers (3)

kakaji

Reputation: 191

from dask import dataframe as dd

df= dd.read_csv(“filename”)

Trust me its fast I am reading 800 mb of file

Upvotes: 0

K_Raikar

Reputation: 126

You can try to use the sxl module (https://pypi.org/project/sxl/). Here is the code I tried for a large excel file (around 75,000 rows) and the timing results:

from datetime import datetime
startTime = datetime.now()
import pandas as pd
import sxl


startTime = datetime.now()
df = pd.read_excel('\\Big_Excel.xlsx')
print("Time taken to load whole data with pandas read excel is {}".format((datetime.now() - startTime)))


startTime = datetime.now()
df = pd.read_excel('\\Big_Excel.xlsx', nrows = 5)
print("Time taken with top 5 rows with pandas read excel is {}".format((datetime.now() - startTime)))


startTime = datetime.now()
wb = sxl.Workbook('\\Big_Excel.xlsx')
ws = wb.sheets[1]
data = ws.head(5)
print("Time taken to load top 5 rows using sxl is {}".format((datetime.now() - startTime)))

Pandas read excel loads the whole data in memory, so there is not much of a difference difference in timing. Here are the outputs from the above:

Time taken to load whole data with pandas read excel is 0:00:49.174538
Time taken with top 5 rows with pandas read excel is 0:00:44.478523
Time taken to load top 5 rows using sxl is 0:00:00.671717

I hope this helps!!

Upvotes: 4

Viknesh S K

Reputation: 121

You can use 'skipfooter' parameter or 'nrows' parameter in both .xlsx & .csv. However, both cannot be used together.

path = r'c:\users\abc\def\stack.xlsx'
df = pd.read_excel(path, skipfooter = 99999)

which means, 99999 rows will be skipped from footer to top & remaining records from header will load.

path = r'c:\users\abc\def\stack.xlsx'
df = pd.read_excel(path, nrows= 5)

which means, first 5 rows will be shown with header.

Also refer this Stack over flow Question.

Upvotes: 0

Python: load excel header without loading remaining data

Answers (3)

Related Questions