Reputation: 33
I am working with very big Excel files, which take a long time to be loaded with Pandas in Python. Before processing the data, the user has to select quite a few options related to the data, which only require the names of the each column in each dataset. It is very inconvenient for the user to have to wait sometimes minutes until the data is loaded to be able to select the necessary options and then let the program do the actual processing for another few minutes.
So, my question is: is there a way to load only the data header from an Excel file with Python? In a way I think of it as an alternate version to the "skiprows" parameter in the read_excel Pandas function, where instead of skipping rows in the beginning of the data, I would like to skip rows at the end of the data. I want to emphasize that my goal is to reduce the time Python takes to load the files. I also know there are ways to do this with csv files, but unfortunately it didn't help me.
Thank you for the help!
Upvotes: 1
Views: 2042
Reputation: 191
from dask import dataframe as dd
df= dd.read_csv(“filename”)
Trust me its fast I am reading 800 mb of file
Upvotes: 0
Reputation: 126
You can try to use the sxl module (https://pypi.org/project/sxl/). Here is the code I tried for a large excel file (around 75,000 rows) and the timing results:
from datetime import datetime
startTime = datetime.now()
import pandas as pd
import sxl
startTime = datetime.now()
df = pd.read_excel('\\Big_Excel.xlsx')
print("Time taken to load whole data with pandas read excel is {}".format((datetime.now() - startTime)))
startTime = datetime.now()
df = pd.read_excel('\\Big_Excel.xlsx', nrows = 5)
print("Time taken with top 5 rows with pandas read excel is {}".format((datetime.now() - startTime)))
startTime = datetime.now()
wb = sxl.Workbook('\\Big_Excel.xlsx')
ws = wb.sheets[1]
data = ws.head(5)
print("Time taken to load top 5 rows using sxl is {}".format((datetime.now() - startTime)))
Pandas read excel loads the whole data in memory, so there is not much of a difference difference in timing. Here are the outputs from the above:
I hope this helps!!
Upvotes: 4
Reputation: 121
You can use 'skipfooter' parameter or 'nrows' parameter in both .xlsx & .csv. However, both cannot be used together.
path = r'c:\users\abc\def\stack.xlsx'
df = pd.read_excel(path, skipfooter = 99999)
which means, 99999 rows will be skipped from footer to top & remaining records from header will load.
path = r'c:\users\abc\def\stack.xlsx'
df = pd.read_excel(path, nrows= 5)
which means, first 5 rows will be shown with header.
Also refer this Stack over flow Question.
Upvotes: 0