Reputation: 21572
How can you read the data from a named range in Excel into a pandas DataFrame?
Unfortunately, the canonical function pandas.read_excel()
is designed to only read entire sheets within a workbook.
Upvotes: 13
Views: 20441
Reputation: 18625
Maybe someday pandas will support this natively. Until then, I use a helper function:
import pandas as pd
import openpyxl
def data_frame_from_xlsx(xlsx_file, range_name):
""" Get a single rectangular region from the specified file.
range_name can be a standard Excel reference ('Sheet1!A2:B7') or
refer to a named region ('my_cells')."""
wb = openpyxl.load_workbook(xlsx_file, data_only=True, read_only=True)
if '!' in range_name:
# passed a worksheet!cell reference
ws_name, reg = range_name.split('!')
if ws_name.startswith("'") and ws_name.endswith("'"):
# optionally strip single quotes around sheet name
ws_name = ws_name[1:-1]
region = wb[ws_name][reg]
else:
# passed a named range; find the cells in the workbook
full_range = wb.get_named_range(range_name)
if full_range is None:
raise ValueError(
'Range "{}" not found in workbook "{}".'.format(range_name, xlsx_file)
)
# convert to list (openpyxl 2.3 returns a list but 2.4+ returns a generator)
destinations = list(full_range.destinations)
if len(destinations) > 1:
raise ValueError(
'Range "{}" in workbook "{}" contains more than one region.'
.format(range_name, xlsx_file)
)
ws, reg = destinations[0]
# convert to worksheet object (openpyxl 2.3 returns a worksheet object
# but 2.4+ returns the name of a worksheet)
if isinstance(ws, str):
ws = wb[ws]
region = ws[reg]
# an anonymous user suggested this to catch a single-cell range (untested):
# if not isinstance(region, 'tuple'): df = pd.DataFrame(region.value)
df = pd.DataFrame([cell.value for cell in row] for row in region)
return df
Upvotes: 7
Reputation: 1062
Well, it's been a while, but I would definitely recommend giving a shot to xlwings.
See also Xlwings take value from defined names .
Upvotes: 0
Reputation:
Here is the way I use openpyxl to copy a range in a [[]] :
wb = load_workbook(filename=xlPath)
ws, range= next(wb.defined_names["rangename"].destinations)
materials = [[cell.value for cell in row] for row in wb[ws][range]]
Upvotes: 3
Reputation: 4360
You can use the underlying xlrd
package to do this.
The xlrd
package comes with an examples
directory which contains xlrdnameAPIdemo.py
, as documented here.
In a nutshell for the named range print_area
try:
book = xlrd.open_workbook('examples/namesdemo.xls')
name_obj = book.name_map['print_area'][0]
print name_obj.__dict__
You'll see name_obj
has an entry:
'result': Operand(kind=oREF, value=[Ref3D(coords=(2, 3, 0, 4, 0, 14))], text=u'Sheet3!$A$1:$N$4')
which you can follow the example to interpret, though it doesn't look straightforward - eg. the range may be relative or not, depending on the value result.kind
.
Further, when I tried to use this to read my own spreadsheet (created on a Mac), I found result
was None
; instead, the only ref to the range in name_obj
was:
'formula_text': u'Sheet1!$B$6:$E$11'
So there may be a way to make this work in a general case, but it looks like it would take some trial and error.
As an alternative, if you can format your spreadsheet so that instead of named ranges, your table follows in the rows immediately after a unique heading (key
), and finishes with a blank row, here is a function which finds the right parameters to send to pd.read_excel
:
def table_position(path, sheet_name, key):
"""
Find the start and end rows of a table in an Excel spreadsheet
based on the first occurence of key text on the sheet, and down
to the first blank line.
Returns (col, start_row, end_row, skip_footer)
where:
col is the column number containing the key text,
start_row is the row after this,
end_row is the row number of the next blank line,
skip_footer is how many rows from the end of the sheet this is.
You can then read in the table with:
x = pd.read_excel(path, sheet_name, skiprows=start, skip_footer=skip_footer, header=0)
x = x.dropna(axis=1, how='all')
"""
import xlrd
book = xlrd.open_workbook(path)
sheet = book.sheet_by_name(sheet_name)
# find the first occurrence of the key, and the next line break
(col, start, end) = (-1, -1, sheet.nrows)
for rownum in xrange(sheet.nrows):
if col<0: # look for key to start the table off
try:
test_col = next(c for c in xrange(sheet.ncols) if sheet.cell(rownum, c).value==key)
except StopIteration:
pass
else:
col, start = test_col, rownum+1 # row after key text is the start
else: # test for blank line as end of table
if not [True for cell in sheet.row(rownum) if cell.value]:
end = rownum
break
skip_footer = sheet.nrows - end
return (col, start, end, skip_footer)
If you do follow this with a pd.read_excel
then you are reading the data file twice, which is silly, but you get the idea.
Upvotes: 1
Reputation: 281
To quote the Microsoft Office help pages!:
A [named range] is a meaningful shorthand that makes it easier to understand the purpose of a cell reference, constant, formula, or table, each of which may be difficult to comprehend at first glance."
Named ranges are furthermore frequently used in spreadsheets to easier access data through ODBC and are particularly useful when there are several data ranges within the same worksheet. To connect via ODBC to Excel, simply choose the appropriate Excel driver and send an SQL statement such as e.g.:
SELECT *
FROM namedRange
The useful command in Pandas would probably be read_sql.
In Windows, this solution requires however that you align/streamline the installed software versions (32-bit or 64-bit) of Excel, the ODBC driver and the software package from which you open the ODBC connection. As an example, an installed Excel 32-bit version will require a 32-bit ODBC driver and normally a 32-bit installation of Python. Note: this latter point remains to be confirmed for the Python case (I'm a beginner to Python), but I can definitely confirm this point for ODBC connections launched from SAS, SPSS or Stata.
The previous requirement is a very significant drawback and actually speaks in favor of any solution which does not involve ODBC at all. That said, it would be nice if read_Excel provided such a facility. In this context, it is interesting to note that SAS, SPSS and Stata currently do not allow direct access to named ranges in their respective Excel filters - so maybe there is an objective reason for this lacking feature...
Upvotes: 3
Reputation: 375535
You can do this in a round about way using read_excel, it offers:
skiprows : list-like
Rows to skip at the beginning (0-indexed)
skip_footer : int, default 0
Rows at the end to skip (0-indexed)
parse_cols : int or list, default None
If None then parse all columns,
If int then indicates last column to be parsed
If list of ints then indicates list of column numbers to be parsed
If string then indicates comma separated list of column names and column ranges (e.g. “A:E” or “A,C,E:F”)
This means if you know the column names and the row numbers (presumably what you mean by "named range"?) you can select just that section to make the DataFrame.
Upvotes: -1