Reputation: 63
First time posting here, I apologize if this question has been asked before - I can't find anything that applies.
Is there a way to read the underlying data from an Excel PivotTable into a Pandas Data Frame? For several years I've had an Excel Auto_Open macro that downloads several Excel files and double clicks on the "Grand Total" row in order to extract all of the data, which ultimate gets imported into a database. This is done because the owners of the source data refuse to grant access to the database itself.
This macro has never been the ideal scenario and we need to move it to a better method soon. I have extensive SQL knowledge but have only recently begun to learn Python.
I have been able to read worksheets using OpenPyXl, but these files do not contain the source data on a separate worksheet by default - the pivotcache must be extracted to a new sheet first. What I would like to do, if possible, is read from the Excel PivotCache into a Pandas Data Frame and either save that output as a CSV or load it directly into our database. It seems that this is not capable with OpenPyXl and that I'll probably need to use win32com.client.
Does anybody have any experience with this, and know if it's even possible? Any pointers for where I might get started? I've tried several items from the Excel Object model (PivotCache, GetData, etc etc) but either I don't know how to use them or they don't return what I need.
Any help would be much appreciated. Thanks!
Upvotes: 6
Views: 2749
Reputation: 2300
Building on @PMHM excellent answer I have modified the code to take care of source data with blank cells. The piece of code that needed modification is the following:
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
# take care of cases where f.v returns an AttributeError because the cell is empty
# fields_map[field.name] = [f.v for f in field.sharedItems._fields]
l = []
for f in field.sharedItems._fields:
try:
l += [f.v]
except AttributeError:
l += [""]
fields_map[field.name] = l
The complete code (mostly copy/paste from above) is therefore:
import numpy as np
import pandas as pd
from openpyxl import load_workbook
from openpyxl.pivot.fields import Missing
file_path = 'path/to/your/file.xlsx'
workbook = load_workbook(file_path)
worksheet = workbook['Plan1']
# Name of desired pivot table (the same name that appears within Excel)
pivot_name = 'Tabela dinâmica1'
# Extract the pivot table object from the worksheet
pivot_table = [p for p in worksheet._pivots if p.name == pivot_name][0]
# Extract a dict of all cache fields and their respective values
fields_map = {}
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
# take care of cases where f.v returns an AttributeError because the cell is empty
# fields_map[field.name] = [f.v for f in field.sharedItems._fields]
l = []
for f in field.sharedItems._fields:
try:
l += [f.v]
except AttributeError:
l += [""]
fields_map[field.name] = l
# Extract all rows from cache records. Each row is initially parsed as a dict
column_names = [field.name for field in pivot_table.cache.cacheFields]
rows = []
for record in pivot_table.cache.records.r:
# If some field in the record in missing, we replace it by NaN
record_values = [
field.v if not isinstance(field, Missing) else np.nan for field in record._fields
]
row_dict = {k: v for k, v in zip(column_names, record_values)}
# Shared fields are mapped as an Index, so we replace the field index by its value
for key in fields_map:
row_dict[key] = fields_map[key][row_dict[key]]
rows.append(row_dict)
df = pd.DataFrame.from_dict(rows)
Upvotes: 4
Reputation: 193
This answer is very late, but I came up with it while struggling with the same issue, and some of the comments above helped me nail it.
In essence, the steps one can take to solve this with openpyxl
are:
openpyxl
to get the openpyxl.pivot.table.TableDefinition
object from the desired pivot table (let's call it my_pivot_table
)my_pivot_table.cache.cacheFields
dict
in two sub-steps:
my_pivot_table.cache.records.r
. Cache fields in these records are stored as indexes from my_pivot_table.cache.cacheFields
cache.records.r
and cache.cacheFields
dict
with rows into a pandas
DataFrameBelow you will find a copy of the code that implements such solution. Since the structure of these Excel objects are somewhat complex, the code will probably look very cryptic (sorry about that). To address this, I'm adding further below minimal examples of the main objects being manipulated, so people can get a better sense of what is going on, what are the objects being returned, etc.
This was the simplest approach I could find to achieve this. I hope it is still useful for someone, albeit some tweaking may be needed for individual cases.
import numpy as np
import pandas as pd
from openpyxl import load_workbook
from openpyxl.pivot.fields import Missing
file_path = 'path/to/your/file.xlsx'
workbook = load_workbook(file_path)
worksheet = workbook['Plan1']
# Name of desired pivot table (the same name that appears within Excel)
pivot_name = 'Tabela dinâmica1'
# Extract the pivot table object from the worksheet
pivot_table = [p for p in worksheet._pivots if p.name == pivot_name][0]
# Extract a dict of all cache fields and their respective values
fields_map = {}
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
fields_map[field.name] = [f.v for f in field.sharedItems._fields]
# Extract all rows from cache records. Each row is initially parsed as a dict
column_names = [field.name for field in pivot_table.cache.cacheFields]
rows = []
for record in pivot_table.cache.records.r:
# If some field in the record in missing, we replace it by NaN
record_values = [
field.v if not isinstance(field, Missing) else np.nan for field in record._fields
]
row_dict = {k: v for k, v in zip(column_names, record_values)}
# Shared fields are mapped as an Index, so we replace the field index by its value
for key in fields_map:
row_dict[key] = fields_map[key][row_dict[key]]
rows.append(row_dict)
df = pd.DataFrame.from_dict(rows)
Results:
>>> df.head(2)
FUEL YEAR REGION STATE UNIT Jan Feb (...)
0 GASOLINE (m3) 2000.0 S TEXAS m3 9563.263 9563.263 (...)
1 GASOLINE (m3) 2000.0 NE NEW YORK m3 3065.758 9563.263 (...)
pivot_table
This is an object of type openpyxl.pivot.table.TableDefinition
. It is quite complex. A small glimpse of it:
<openpyxl.pivot.table.TableDefinition object>
Parameters:
name='Tabela dinâmica1', cacheId=36, dataOnRows=True, dataPosition=None, (A LOT OF OMITTED STUFF...)
Parameters:
ref='B52:W66', firstHeaderRow=1, firstDataRow=2, firstDataCol=1, rowPageCount=2, colPageCount=1, pivotFields=[<openpyxl.pivot.table.PivotField object>
Parameters: (A LOT OF OMITTED STUFF...)
fields_map
(from cache.cacheFields
)This is a dict
with column name and their available values:
{'YEAR': [2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, 2006.0, 2007.0, 2008.0,
2009.0, 2010.0, 2011.0, 2012.0, 2013.0, 2014.0, 2015.0, 2016.0, 2017.0,
2018.0, 2019.0, 2020.0],
'FUEL': ['GASOLINE (m3)', 'AVIATION GASOLINE (m3)', 'KEROSENE (m3)'],
'STATE': ['TEXAS', 'NEW YORK', 'MAINE', (...)],
'REGION': ['S', 'NE', 'N', (...)]}
row_dict
(before mapping)Each row is a dict
with column names and their values. Raw values for cache fields are not stored here. Here they are represented by their indexes in cache.cacheFields
(see above)
{'YEAR': 0, # <<<--- 0 stands for index in fields_map
'Jan': 10719.983,
'Feb': 12482.281,
'FUEL': 0, # <<<--- index in fields_map
'Dec': 10818.094,
'STATE': 0, # <<<--- index in fields_map
(...)
'UNIT': 'm3'}
row_dict
(after mapping)After extracting raw values for cache fields from their indexes, we have a dict
that represent all values of a row:
{'YEAR': 2000.0, # extracted column value from index in fields_map
'Jan': 10719.983,
'Feb': 12482.281,
'FUEL': 'GASOLINE (m3)', # extracted from fields_map
'Dec': 10818.094,
'STATE': 'TEXAS', # extracted from fields_map
(...)
'UNIT': 'm3'}
Upvotes: 7