Reputation: 765
I'm currently using pandas to read an Excel file and present its sheet names to the user, so he can select which sheet he would like to use. The problem is that the files are really big (70 columns x 65k rows), taking up to 14s to load on a notebook (the same data in a CSV file is taking 3s).
My code in panda goes like this:
xls = pandas.ExcelFile(path)
sheets = xls.sheet_names
I tried xlrd before, but obtained similar results. This was my code with xlrd:
xls = xlrd.open_workbook(path)
sheets = xls.sheet_names
So, can anybody suggest a faster way to retrieve the sheet names from an Excel file than reading the whole file?
Upvotes: 72
Views: 185245
Reputation: 166
import pandas as pd
df = pd.read_excel(file, sheet_name=None, nrows=0)
print(df.keys())
try this
Upvotes: 3
Reputation: 23
A naive solution for handling all of xlsx
, xlsm
, and xlsb
can be further derived from Cedric's answer
from __future__ import annotations
import html
import re
import zipfile
from pathlib import Path
def _get_xlsx_names(file_path: Path) -> list[str]:
with zipfile.ZipFile(file_path, "r") as zip_ref:
xml = zip_ref.read("xl/workbook.xml").decode("utf-8")
return [
html.unescape(sheet_name)
for sheet_name in re.findall(r'<sheet.*?name="([^"]+?)".*?/>', xml)
]
def _get_xlsm_names(file_path: Path) -> list[str]:
with zipfile.ZipFile(file_path, "r") as zip_ref:
xml = zip_ref.read("docProps/app.xml").decode("utf-8")
return [
html.unescape(sheet_name)
# Find all titles
for titles in re.findall(r"<TitlesOfParts>(.+?)</TitlesOfParts>", xml)
# Find all sheet names in titles
for sheet_name in re.findall(r"<vt:lpstr>(.+?)</vt:lpstr>", titles)
]
def get_sheet_names(file_path: str | Path) -> list[str]:
file_path = Path(file_path)
if file_path.suffix == ".xlsx":
return _get_xlsx_names(file_path)
if file_path.suffix in [".xlsm", ".xlsb"]:
return _get_xlsm_names(file_path)
msg = f"Unsupported file extension '{file_path.suffix}' found."
raise NotImplementedError(msg)
Regarding the usage of "
(and <
, >
, &
) in sheet names, from my limited experimentation:
xlsx
files replace "
with "
xlsx
, xlsm
, and xlsb
replace <
with <
, >
with >
, and &
with &
html.unescape
can solve all of these issues for us, and we're still only using the standard library.
Tweaking our regexes to use capture groups can help simplify our python syntax.
While this works on my machine for my limited testing, all this use of regex to quickly parse xml does put me on edge.
Upvotes: 0
Reputation: 11
XLSB & XLSM solution. Inspired by Cedric Bonjour.
import re
import zipfile
def get_sheet_names(file_path):
with zipfile.ZipFile(file_path, 'r') as zip_ref:
xml = zip_ref.read("docProps/app.xml").decode("utf-8")
xml = re.findall("<TitlesOfParts>.*</TitlesOfParts>", xml)[0]
sheets = re.findall(">([^>]*)<", xml)
sheets = list(filter(None,sheets))
return sheets
Upvotes: 1
Reputation: 13
Simple way to read excel sheet names :
import openpyxl
wb = openpyxl.load_workbook(r'<path-to-filename>')
print(wb.sheetnames)
To read data from specific sheet in excel using pandas :
import pandas as pd
df = pd.read_excel(io = '<path-to-file>', engine='openpyxl', sheet_name = 'Report', header=7, skipfooter=1).drop_duplicates()
Upvotes: 1
Reputation: 331
Using standard libraries only:
import re
from pathlib import Path
import xml.etree.ElementTree as ET
from zipfile import Path as ZipPath
def sheet_names(path: Path) -> tuple[str, ...]:
xml: bytes = ZipPath(path, at="xl/workbook.xml").read_bytes()
root: ET.Element = ET.fromstring(xml)
namespace = m.group(0) if (m := re.match(r"\{.*\}", root.tag)) else ""
return tuple(x.attrib["name"] for x in root.findall(f"./{namespace}sheets/") if x.tag == f"{namespace}sheet")
Upvotes: 2
Reputation: 51
Building on dhwanil-shah's answer, I find this to be the most efficient:
import os
import re
import zipfile
def get_excel_sheet_names(file_path):
sheets = []
with zipfile.ZipFile(file_path, 'r') as zip_ref: xml = zip_ref.read("xl/workbook.xml").decode("utf-8")
for s_tag in re.findall("<sheet [^>]*", xml) : sheets.append( re.search('name="[^"]*', s_tag).group(0)[6:])
return sheets
sheets = get_excel_sheet_names("Book1.xlsx")
print(sheets)
# prints: "['Sheet1', 'my_sheet 2']"
xlsb working altenative
import os
import re
import zipfile
def get_xlsb_sheet_names(file_path):
sheets = []
with zipfile.ZipFile(file_path, 'r') as zip_ref: xml = zip_ref.read("docProps/app.xml").decode("utf-8")
xml=grep("<TitlesOfParts>.*</TitlesOfParts>", xml)
for s_tag in re.findall("<vt:lpstr>.*</vt:lpstr>", xml) : sheets.append( re.search('>.*<', s_tag).group(0))[1:-1])
return sheets
Advantages are :
To be improved:
Upvotes: 5
Reputation: 9996
From my research with the standard / popular libs this hasn't been implemented as of 2020 for xlsx
/ xls
but you can do this for xlsb
. Either way these solutions should give you vast performance improvements. for xls
, xlsx
, xlsb
.
Below was benchmarked on a ~10Mb xlsx
, xlsb
file.
xlsx, xls
from openpyxl import load_workbook
def get_sheetnames_xlsx(filepath):
wb = load_workbook(filepath, read_only=True, keep_links=False)
return wb.sheetnames
Benchmark: ~ 14x speed improvement
# get_sheetnames_xlsx vs pd.read_excel
225 ms ± 6.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.25 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
xlsb
from pyxlsb import open_workbook
def get_sheetnames_xlsb(filepath):
with open_workbook(filepath) as wb:
return wb.sheets
Benchmark: ~ 56x speed improvement
# get_sheetnames_xlsb vs pd.read_excel
96.4 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
5.36 s ± 162 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Notes:
xlrd
is no longer maintained as of 2020Upvotes: 24
Reputation: 515
Python code adaptation with full pathlib path filename passed (e.g., ('c:\xml\file.xlsx')). From Dhwanil shah answer, without Django method used to create a temp dir.
import xmltodict
import shutil
import zipfile
def get_sheet_details(filename):
sheets = []
# Make a temporary directory with the file name
directory_to_extract_to = (filename.with_suffix(''))
directory_to_extract_to.mkdir(parents=True, exist_ok=True)
# Extract the xlsx file as it is just a zip file
zip_ref = zipfile.ZipFile(filename, 'r')
zip_ref.extractall(directory_to_extract_to)
zip_ref.close()
# Open the workbook.xml which is very light and only has meta data, get sheets from it
path_to_workbook = directory_to_extract_to / 'xl' / 'workbook.xml'
with open(path_to_workbook, 'r') as f:
xml = f.read()
dictionary = xmltodict.parse(xml)
for sheet in dictionary['workbook']['sheets']['sheet']:
sheet_details = {
'id': sheet['@sheetId'], # can be sheetId for some versions
'name': sheet['@name'] # can be name
}
sheets.append(sheet_details)
# Delete the extracted files directory
shutil.rmtree(directory_to_extract_to)
return sheets
Upvotes: 1
Reputation: 93
By combining @Dhwanil shah's answer with the answer here I wrote code that is also compatible with xlsx files that have only one sheet:
def get_sheet_ids(file_path):
sheet_names = []
with zipfile.ZipFile(file_path, 'r') as zip_ref:
xml = zip_ref.open(r'xl/workbook.xml').read()
dictionary = xmltodict.parse(xml)
if not isinstance(dictionary['workbook']['sheets']['sheet'], list):
sheet_names.append(dictionary['workbook']['sheets']['sheet']['@name'])
else:
for sheet in dictionary['workbook']['sheets']['sheet']:
sheet_names.append(sheet['@name'])
return sheet_names
Upvotes: 3
Reputation: 1
you can also use
data=pd.read_excel('demanddata.xlsx',sheet_name='oil&gas')
print(data)
Here demanddata is the name of your file oil&gas is one of your sheet name.Let there may be n number of sheet in your worksheet.Just Give the Name of the sheet which you like to fetch at Sheet_name="Name of Your required sheet"
Upvotes: -4
Reputation: 468
I have tried xlrd, pandas, openpyxl and other such libraries and all of them seem to take exponential time as the file size increase as it reads the entire file. The other solutions mentioned above where they used 'on_demand' did not work for me. The following function works for xlsx files.
def get_sheet_details(file_path):
sheets = []
file_name = os.path.splitext(os.path.split(file_path)[-1])[0]
# Make a temporary directory with the file name
directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name)
os.mkdir(directory_to_extract_to)
# Extract the xlsx file as it is just a zip file
zip_ref = zipfile.ZipFile(file_path, 'r')
zip_ref.extractall(directory_to_extract_to)
zip_ref.close()
# Open the workbook.xml which is very light and only has meta data, get sheets from it
path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml')
with open(path_to_workbook, 'r') as f:
xml = f.read()
dictionary = xmltodict.parse(xml)
for sheet in dictionary['workbook']['sheets']['sheet']:
sheet_details = {
'id': sheet['sheetId'], # can be @sheetId for some versions
'name': sheet['name'] # can be @name
}
sheets.append(sheet_details)
# Delete the extracted files directory
shutil.rmtree(directory_to_extract_to)
return sheets
Since all xlsx are basically zipped files, we extract the underlying xml data and read sheet names from the workbook directly which takes a fraction of a second as compared to the library functions.
Benchmarking: (On a 6mb xlsx file with 4 sheets)
Pandas, xlrd: 12 seconds
openpyxl: 24 seconds
Proposed method: 0.4 seconds
Upvotes: 18
Reputation: 1427
you can use the xlrd library and open the workbook with the "on_demand=True" flag, so that the sheets won't be loaded automaticaly.
Than you can retrieve the sheet names in a similar way to pandas:
import xlrd
xls = xlrd.open_workbook(r'<path_to_your_excel_file>', on_demand=True)
print xls.sheet_names() # <- remeber: xlrd sheet_names is a function, not a property
Upvotes: 88