Reputation: 1270
What are the Python3 options to efficiently (performance and memory) extract sheet names and for a given sheet, and also column names from a very large .xlsx file?
I've tried using pandas:
For sheet names using pd.ExcelFile
:
xl = pd.ExcelFile(filename)
return xl.sheet_names
For column names using pd.ExcelFile
:
xl = pd.ExcelFile(filename)
df = xl.parse(sheetname, nrows=2, **kwargs)
df.columns
For column names using pd.read_excel
with and without nrows
(>v23):
df = pd.read_excel(io=filename, sheet_name=sheetname, nrows=2)
df.columns
However, both pd.ExcelFile
and and pd.read_excel
seem to read the entire .xlsx in memory and are therefore slow.
Thanks a lot!
Upvotes: 4
Views: 9263
Reputation: 175
I think this would help the need
from openpyxl import load_workbook
workbook = load_workbook(filename, read_only=True)
data = {} #for storing the value of sheet with their respective columns
for sheet in worksheets:
for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
data[sheet.title] = value #value would be a tuple with headings of each column
Upvotes: 1
Reputation: 1
This program lists all the sheets in the excel. Pandas is used here.
import pandas as pd
with pd.ExcelFile('yourfile.xlsx') as xlsx :
sh=xlsx.sheet_names
print("This workbook has the following sheets : ",sh)
Upvotes: 0
Reputation: 781
Here is the easiest way I can share with you:
# read the sheet file
import pandas as pd
my_sheets = pd.ExcelFile('sheet_filename.xlsx')
my_sheets.sheet_names
Upvotes: 4
Reputation: 2072
According to this SO question, reading excel files in chunks is not supported (see this issue on github), and using nrows
will always read all the file into memory first.
Possible solutions:
Upvotes: 1