Reputation: 625
I know read excel table with pandas:
import pandas as pd
table = pd.read_excel(io)
After loading the data, if I want to get the table header:
table.columns
This method is feasible, but sometimes I just want to get the header of the excel table directly, especially when the excel table has a large body size, it will be very time-consuming to load the data table into the memory & it is also unnecessary, sometimes it even overflows directly and gets stuck. Looking at the official documents, it seems that I can use the nrows
parameter to specify that only specific lines of Excel can be read, This means that I can use it to read only the first row header:
header = pd.read_excel(io, nrows = 0)
However, I found that also can not prevent pandas read the whole excel data, and it will still consume a lot of time and memory. Do you have good experience in dealing with this problem?
Upvotes: 5
Views: 11230
Reputation: 21
import pandas as pd
Frame = pd.read_excel("/content/data.xlsx", header=0)
print(Frame.head(0))
This will give you the header only, assuming the header is on row 1
. If no value is entered, the default value of 5
is assumed. Hence that is why you obtain multiple lines.
Upvotes: 2
Reputation: 21947
This function sheet_rows
uses openpyxl
directly, not pandas
;
it's much faster than read_excel( nrows=0 )
, and simple:
#!/usr/bin/env python3
import openpyxl # https://openpyxl.readthedocs.io
#...............................................................................
def sheet_rows( sheet, nrows=3, ncols=None, verbose=5 ) -> "list of lists":
""" openpyxl sheet -> the first `nrows` rows x `ncols` columns
verbose=5: print A1 .. A5, E1 .. E5 as lists
"""
rows = sheet.iter_rows( max_row=nrows, max_col=ncols, values_only=True )
rows = [list(r) for r in rows] # generator -> list of lists
if verbose:
print( "\n-- %s %d rows %d cols" % (
sheet.title, sheet.max_row, sheet.max_column ))
for row in rows[:verbose]:
trimNone = list( filter( None, row[:verbose] ))
print( trimNone )
return rows
# xlsxin = sys.argv[1]
wb = openpyxl.load_workbook( xlsxin, read_only=True )
print( "\n-- openpyxl.load_workbook( \"%s\" )" % xlsxin )
for sheetname in wb.sheetnames:
sheet = wb[sheetname]
rows = sheet_rows( sheet, nrows=nrows )
df = (pd.DataFrame( rows ) # index= columns=
.dropna( axis="index", how="all" )
.dropna( axis="columns", how="all" )
)
print( df )
# df.to_excel df.to_csv ...
"Partial read" under pyexcel
explains that most Excel readers
read ALL the data into memory before doing anything else -- slow.
openpyxl iter_rows()
gets a few rows or columns fast, memory don't know.
Upvotes: 2
Reputation: 66
import pandas as pd
Frame=pd.read_excel("/content/data.xlsx" , header=0)
Frame.head()
Upvotes: 0