Xu Zhoufeng
Xu Zhoufeng

Reputation: 625

how to use Pandas to only read excel header?

I know read excel table with pandas:

import pandas as pd

table = pd.read_excel(io)

After loading the data, if I want to get the table header:

table.columns

This method is feasible, but sometimes I just want to get the header of the excel table directly, especially when the excel table has a large body size, it will be very time-consuming to load the data table into the memory & it is also unnecessary, sometimes it even overflows directly and gets stuck. Looking at the official documents, it seems that I can use the nrows parameter to specify that only specific lines of Excel can be read, This means that I can use it to read only the first row header:

header = pd.read_excel(io, nrows = 0)

However, I found that also can not prevent pandas read the whole excel data, and it will still consume a lot of time and memory. Do you have good experience in dealing with this problem?

Upvotes: 5

Views: 11230

Answers (3)

Aziz A. Rasul
Aziz A. Rasul

Reputation: 21

import pandas as pd 
Frame = pd.read_excel("/content/data.xlsx", header=0)
print(Frame.head(0))

This will give you the header only, assuming the header is on row 1. If no value is entered, the default value of 5 is assumed. Hence that is why you obtain multiple lines.

Upvotes: 2

denis
denis

Reputation: 21947

This function sheet_rows uses openpyxl directly, not pandas; it's much faster than read_excel( nrows=0 ), and simple:

#!/usr/bin/env python3

import openpyxl  # https://openpyxl.readthedocs.io

#...............................................................................
def sheet_rows( sheet, nrows=3, ncols=None, verbose=5 ) -> "list of lists":
    """ openpyxl sheet -> the first `nrows` rows x `ncols` columns
        verbose=5: print A1 .. A5, E1 .. E5 as lists
    """
    rows = sheet.iter_rows( max_row=nrows, max_col=ncols, values_only=True )
    rows = [list(r) for r in rows]  # generator -> list of lists
    if verbose:
        print( "\n-- %s  %d rows  %d cols" % (
                sheet.title, sheet.max_row, sheet.max_column ))
        for row in rows[:verbose]:
            trimNone = list( filter( None, row[:verbose] ))
            print( trimNone )
    return rows


# xlsxin = sys.argv[1]
wb = openpyxl.load_workbook( xlsxin, read_only=True )
print( "\n-- openpyxl.load_workbook( \"%s\" )" % xlsxin )

for sheetname in wb.sheetnames:
    sheet = wb[sheetname]

    rows = sheet_rows( sheet, nrows=nrows )

    df = (pd.DataFrame( rows )  # index= columns=
            .dropna( axis="index", how="all" )
            .dropna( axis="columns", how="all" ) 
            )
    print( df )
    # df.to_excel df.to_csv ...

"Partial read" under pyexcel explains that most Excel readers read ALL the data into memory before doing anything else -- slow. openpyxl iter_rows() gets a few rows or columns fast, memory don't know.

Upvotes: 2

Mitali Patel
Mitali Patel

Reputation: 66

import pandas as pd 

Frame=pd.read_excel("/content/data.xlsx" , header=0)
Frame.head()

Upvotes: 0

Related Questions