Reputation: 21572

Reading an Excel named range into a pandas DataFrame

How can you read the data from a named range in Excel into a pandas DataFrame?

Unfortunately, the canonical function pandas.read_excel() is designed to only read entire sheets within a workbook.

Upvotes: 13

Answers (6)

Matthias Fripp

Reputation: 18625

Maybe someday pandas will support this natively. Until then, I use a helper function:

import pandas as pd
import openpyxl

def data_frame_from_xlsx(xlsx_file, range_name):
    """ Get a single rectangular region from the specified file.
    range_name can be a standard Excel reference ('Sheet1!A2:B7') or 
    refer to a named region ('my_cells')."""
    wb = openpyxl.load_workbook(xlsx_file, data_only=True, read_only=True)
    if '!' in range_name:
        # passed a worksheet!cell reference
        ws_name, reg = range_name.split('!')
        if ws_name.startswith("'") and ws_name.endswith("'"):
            # optionally strip single quotes around sheet name
            ws_name = ws_name[1:-1]
        region = wb[ws_name][reg]
    else:
        # passed a named range; find the cells in the workbook
        full_range = wb.get_named_range(range_name)
        if full_range is None:
            raise ValueError(
                'Range "{}" not found in workbook "{}".'.format(range_name, xlsx_file)
            )
        # convert to list (openpyxl 2.3 returns a list but 2.4+ returns a generator)
        destinations = list(full_range.destinations) 
        if len(destinations) > 1:
            raise ValueError(
                'Range "{}" in workbook "{}" contains more than one region.'
                .format(range_name, xlsx_file)
            )
        ws, reg = destinations[0]
        # convert to worksheet object (openpyxl 2.3 returns a worksheet object 
        # but 2.4+ returns the name of a worksheet)
        if isinstance(ws, str):
            ws = wb[ws]
        region = ws[reg]
    # an anonymous user suggested this to catch a single-cell range (untested):
    # if not isinstance(region, 'tuple'): df = pd.DataFrame(region.value)
    df = pd.DataFrame([cell.value for cell in row] for row in region)
    return df

Upvotes: 7

Giorgio Balestrieri

Reputation: 1062

Well, it's been a while, but I would definitely recommend giving a shot to xlwings.

Upvotes: 0

user5852223

Reputation:

Here is the way I use openpyxl to copy a range in a [[]] :

wb = load_workbook(filename=xlPath)
ws, range= next(wb.defined_names["rangename"].destinations)
materials = [[cell.value for cell in row] for row in wb[ws][range]]

Upvotes: 3

Racing Tadpole

Reputation: 4360

You can use the underlying xlrd package to do this.

The xlrd package comes with an examples directory which contains xlrdnameAPIdemo.py, as documented here.

In a nutshell for the named range print_area try:

book = xlrd.open_workbook('examples/namesdemo.xls')
name_obj = book.name_map['print_area'][0]
print name_obj.__dict__

You'll see name_obj has an entry:

'result': Operand(kind=oREF, value=[Ref3D(coords=(2, 3, 0, 4, 0, 14))], text=u'Sheet3!$A$1:$N$4')

which you can follow the example to interpret, though it doesn't look straightforward - eg. the range may be relative or not, depending on the value result.kind.

Further, when I tried to use this to read my own spreadsheet (created on a Mac), I found result was None; instead, the only ref to the range in name_obj was:

'formula_text': u'Sheet1!$B$6:$E$11'

So there may be a way to make this work in a general case, but it looks like it would take some trial and error.

As an alternative, if you can format your spreadsheet so that instead of named ranges, your table follows in the rows immediately after a unique heading (key), and finishes with a blank row, here is a function which finds the right parameters to send to pd.read_excel:

def table_position(path, sheet_name, key):
    """
    Find the start and end rows of a table in an Excel spreadsheet
    based on the first occurence of key text on the sheet, and down
    to the first blank line.

    Returns (col, start_row, end_row, skip_footer)

    where: 
        col is the column number containing the key text,
        start_row is the row after this, 
        end_row is the row number of the next blank line,
        skip_footer is how many rows from the end of the sheet this is.

    You can then read in the table with:
        x = pd.read_excel(path, sheet_name, skiprows=start, skip_footer=skip_footer, header=0)
        x = x.dropna(axis=1, how='all')
    """
    import xlrd
    book = xlrd.open_workbook(path)
    sheet = book.sheet_by_name(sheet_name)
    # find the first occurrence of the key, and the next line break
    (col, start, end) = (-1, -1, sheet.nrows)
    for rownum in xrange(sheet.nrows):
        if col<0: # look for key to start the table off
            try:
                test_col = next(c for c in xrange(sheet.ncols) if sheet.cell(rownum, c).value==key)
            except StopIteration:
                pass
            else:
                col, start = test_col, rownum+1 # row after key text is the start
        else: # test for blank line as end of table
            if not [True for cell in sheet.row(rownum) if cell.value]:
                end = rownum
                break
    skip_footer = sheet.nrows - end
    return (col, start, end, skip_footer)

If you do follow this with a pd.read_excel then you are reading the data file twice, which is silly, but you get the idea.

Upvotes: 1

g3o2

Reputation: 281

To quote the Microsoft Office help pages!:

A [named range] is a meaningful shorthand that makes it easier to understand the purpose of a cell reference, constant, formula, or table, each of which may be difficult to comprehend at first glance."

Named ranges are furthermore frequently used in spreadsheets to easier access data through ODBC and are particularly useful when there are several data ranges within the same worksheet. To connect via ODBC to Excel, simply choose the appropriate Excel driver and send an SQL statement such as e.g.:

SELECT * 
FROM namedRange

The useful command in Pandas would probably be read_sql.

In Windows, this solution requires however that you align/streamline the installed software versions (32-bit or 64-bit) of Excel, the ODBC driver and the software package from which you open the ODBC connection. As an example, an installed Excel 32-bit version will require a 32-bit ODBC driver and normally a 32-bit installation of Python. Note: this latter point remains to be confirmed for the Python case (I'm a beginner to Python), but I can definitely confirm this point for ODBC connections launched from SAS, SPSS or Stata.

The previous requirement is a very significant drawback and actually speaks in favor of any solution which does not involve ODBC at all. That said, it would be nice if read_Excel provided such a facility. In this context, it is interesting to note that SAS, SPSS and Stata currently do not allow direct access to named ranges in their respective Excel filters - so maybe there is an objective reason for this lacking feature...

Upvotes: 3

Andy Hayden

Reputation: 375535

You can do this in a round about way using read_excel, it offers:

skiprows : list-like
    Rows to skip at the beginning (0-indexed)

skip_footer : int, default 0
    Rows at the end to skip (0-indexed)

parse_cols : int or list, default None
        If None then parse all columns,
        If int then indicates last column to be parsed
        If list of ints then indicates list of column numbers to be parsed
        If string then indicates comma separated list of column names and column ranges (e.g. “A:E” or “A,C,E:F”)

This means if you know the column names and the row numbers (presumably what you mean by "named range"?) you can select just that section to make the DataFrame.

Upvotes: -1

Reading an Excel named range into a pandas DataFrame

Answers (6)

Related Questions