Reputation: 96274
The new version of Pandas uses the following interface to load Excel files:
read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])
but what if I don't know the sheets that are available?
For example, I am working with excel files that the following sheets
Data 1, Data 2 ..., Data N, foo, bar
but I don't know N
a priori.
Is there any way to get the list of sheets from an excel document in Pandas?
Upvotes: 308
Views: 429960
Reputation: 1893
You should explicitly specify the second parameter (sheet_name) as None. like this:
df = pandas.read_excel("/yourPath/FileName.xlsx", sheet_name=None);
"df" are all sheets as a dictionary of DataFrames, you can verify it by run this:
df.keys()
result like this:
[u'201610', u'201601', u'201701', u'201702', u'201703', u'201704', u'201705', u'201706', u'201612', u'fund', u'201603', u'201602', u'201605', u'201607', u'201606', u'201608', u'201512', u'201611', u'201604']
please refer pandas doc for more details: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
Upvotes: 100
Reputation: 663
If using pd.read_excel()
with sheet_name=None
to read sheet names, setting nrows=0
can significantly boost speeds.
Benchmarking on an 8MB XLSX file with nine sheets:
%timeit pd.read_excel('foo.xlsx', sheet_name=None, nrows=0).keys()
>>> 145 ms ± 50.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.read_excel('foo.xlsx', sheet_name=None).keys()
>>> 16.5 s ± 2.04 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
This takes advantage of the fact that (as you might expect) pandas
stops reading a sheet once it reaches nrows
(reference).
Upvotes: 3
Reputation: 39
import pandas as pd
path = "\\DB\\Expense\\reconcile\\"
file_name = "202209-v01.xlsx"
df = pd.read_excel(path + file_name, None)
print(df)
sheet_names = list(df.keys())
# print last sheet name
print(sheet_names[len(sheet_names)-1])
last_month = df.get(sheet_names[len(sheet_names)-1])
print(last_month)
Upvotes: 2
Reputation: 341
The easiest way to retrieve the sheet-names from an excel (xls., xlsx) is:
tabs = pd.ExcelFile("path").sheet_names
print(tabs)
Then to read and store the data of a particular sheet (say, sheet names are "Sheet1", "Sheet2", etc.), say "Sheet2" for example:
data = pd.read_excel("path", "Sheet2")
print(data)
Upvotes: 24
Reputation: 71
#It will work for Both '.xls' and '.xlsx' by using pandas
import pandas as pd
excel_Sheet_names = (pd.ExcelFile(excelFilePath)).sheet_names
#for '.xlsx' use only openpyxl
from openpyxl import load_workbook
excel_Sheet_names = (load_workbook(excelFilePath, read_only=True)).sheet_names
Upvotes: 6
Reputation: 135
if you read excel file
dfs = pd.ExcelFile('file')
then use
dfs.sheet_names
dfs.parse('sheetname')
another variant
df = pd.read_excel('file', sheet_name='sheetname')
Upvotes: 3
Reputation: 11
With the load_workbook readonly option, what was earlier seen as a execution seen visibly waiting for many seconds happened with milliseconds. The solution could however be still improved.
import pandas as pd
from openpyxl import load_workbook
class ExcelFile:
def __init__(self, **kwargs):
........
.....
self._SheetNames = list(load_workbook(self._name,read_only=True,keep_links=False).sheetnames)
The Excelfile.parse takes the same time as reading the complete xls in order of 10s of sec. This result was obtained with windows 10 operating system with below package versions
C:\>python -V
Python 3.9.1
C:\>pip list
Package Version
--------------- -------
et-xmlfile 1.0.1
numpy 1.20.2
openpyxl 3.0.7
pandas 1.2.3
pip 21.0.1
python-dateutil 2.8.1
pytz 2021.1
pyxlsb 1.0.8
setuptools 49.2.1
six 1.15.0
xlrd 2.0.1
Upvotes: 1
Reputation: 9996
If you:
Below was benchmarked on a ~10Mb xlsx
, xlsb
file.
xlsx, xls
from openpyxl import load_workbook
def get_sheetnames_xlsx(filepath):
wb = load_workbook(filepath, read_only=True, keep_links=False)
return wb.sheetnames
Benchmarks: ~ 14x speed improvement
# get_sheetnames_xlsx vs pd.read_excel
225 ms ± 6.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.25 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
xlsb
from pyxlsb import open_workbook
def get_sheetnames_xlsb(filepath):
with open_workbook(filepath) as wb:
return wb.sheets
Benchmarks: ~ 56x speed improvement
# get_sheetnames_xlsb vs pd.read_excel
96.4 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
5.36 s ± 162 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Notes:
xlrd
is no longer maintained as of 2020Upvotes: 4
Reputation: 2471
from openpyxl import load_workbook
sheets = load_workbook(excel_file, read_only=True).sheetnames
For a 5MB Excel file I'm working with, load_workbook
without the read_only
flag took 8.24s. With the read_only
flag it only took 39.6 ms. If you still want to use an Excel library and not drop to an xml solution, that's much faster than the methods that parse the whole file.
Upvotes: 3
Reputation: 151
This is the fastest way I have found, inspired by @divingTobi's answer. All The answers based on xlrd, openpyxl or pandas are slow for me, as they all load the whole file first.
from zipfile import ZipFile
from bs4 import BeautifulSoup # you also need to install "lxml" for the XML parser
with ZipFile(file) as zipped_file:
summary = zipped_file.open(r'xl/workbook.xml').read()
soup = BeautifulSoup(summary, "xml")
sheets = [sheet.get("name") for sheet in soup.find_all("sheet")]
Upvotes: 15
Reputation: 2300
Building on @dhwanil_shah 's answer, you do not need to extract the whole file. With zf.open
it is possible to read from a zipped file directly.
import xml.etree.ElementTree as ET
import zipfile
def xlsxSheets(f):
zf = zipfile.ZipFile(f)
f = zf.open(r'xl/workbook.xml')
l = f.readline()
l = f.readline()
root = ET.fromstring(l)
sheets=[]
for c in root.findall('{http://schemas.openxmlformats.org/spreadsheetml/2006/main}sheets/*'):
sheets.append(c.attrib['name'])
return sheets
The two consecutive readline
s are ugly, but the content is only in the second line of the text. No need to parse the whole file.
This solution seems to be much faster than the read_excel
version, and most likely also faster than the full extract version.
Upvotes: 3
Reputation: 468
I have tried xlrd, pandas, openpyxl and other such libraries and all of them seem to take exponential time as the file size increase as it reads the entire file. The other solutions mentioned above where they used 'on_demand' did not work for me. If you just want to get the sheet names initially, the following function works for xlsx files.
def get_sheet_details(file_path):
sheets = []
file_name = os.path.splitext(os.path.split(file_path)[-1])[0]
# Make a temporary directory with the file name
directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name)
os.mkdir(directory_to_extract_to)
# Extract the xlsx file as it is just a zip file
zip_ref = zipfile.ZipFile(file_path, 'r')
zip_ref.extractall(directory_to_extract_to)
zip_ref.close()
# Open the workbook.xml which is very light and only has meta data, get sheets from it
path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml')
with open(path_to_workbook, 'r') as f:
xml = f.read()
dictionary = xmltodict.parse(xml)
for sheet in dictionary['workbook']['sheets']['sheet']:
sheet_details = {
'id': sheet['@sheetId'],
'name': sheet['@name']
}
sheets.append(sheet_details)
# Delete the extracted files directory
shutil.rmtree(directory_to_extract_to)
return sheets
Since all xlsx are basically zipped files, we extract the underlying xml data and read sheet names from the workbook directly which takes a fraction of a second as compared to the library functions.
Benchmarking: (On a 6mb xlsx file with 4 sheets)
Pandas, xlrd: 12 seconds
openpyxl: 24 seconds
Proposed method: 0.4 seconds
Since my requirement was just reading the sheet names, the unnecessary overhead of reading the entire time was bugging me so I took this route instead.
Upvotes: 3
Reputation: 375485
You can still use the ExcelFile class (and the sheet_names
attribute):
xl = pd.ExcelFile('foo.xls')
xl.sheet_names # see all sheet names
xl.parse(sheet_name) # read a specific sheet to DataFrame
see docs for parse for more options...
Upvotes: 523