Reputation: 6709
I have Excel files with multiple sheets, each of which looks a little like this (but much longer):
Sample CD4 CD8
Day 1 8311 17.3 6.44
8312 13.6 3.50
8321 19.8 5.88
8322 13.5 4.09
Day 2 8311 16.0 4.92
8312 5.67 2.28
8321 13.0 4.34
8322 10.6 1.95
The first column is actually four cells merged vertically.
When I read this using pandas.read_excel, I get a DataFrame that looks like this:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
How can I either get Pandas to understand merged cells, or quickly and easily remove the NaN and group by the appropriate value? (One approach would be to reset the index, step through to find the values and replace NaNs with values, pass in the list of days, then set the index to the column. But it seems like there should be a simpler approach.)
Upvotes: 66
Views: 83167
Reputation: 141
To read an Excel file where merged cells are filled in (in other words, the Pandas DataFrame values are all the same), I used the following code. It was largely inspired by @ztr. Thank you, ztr.
from openpyxl import load_workbook
import pandas as pd
def _convert_cell_ref_to_df_ref(cell_ref: tuple) -> tuple:
col_offset = 1
row_offset = 1
return (cell_ref[0] - col_offset, cell_ref[1] - row_offset)
file_path = '/file/path.xlsx' # Will not work for `.xls`
sheet_name = 'sheet name'
excel = pd.ExcelFile(file_path)
df = excel.parse(
sheet_name=sheet_name,
header=None, # If you want to keep the default headers, then remove this argument and change `col_offset` from 1 to 2.
)
openpyxl_wb = load_workbook(file_path)
for merged_cell in openpyxl_wb[sheet_name].merged_cells:
try:
merge_val = df.iloc[_convert_cell_ref_to_df_ref(next(iter(merged_cell.cells)))]
for cell in merged_cell.cells:
df.iloc[_convert_cell_ref_to_df_ref(cell)] = merge_val
except IndexError as e:
print(f"Most likely the last row in this Excel file is a blank merged cell, which is often times trimmed when read by Pandas.")
print(e)
See my other answer for how to read sheet names from an Excel file.
Upvotes: 1
Reputation: 1
You can use openpyxl. Note: the excel sheet contains a header row and a index column while the dataframe doesn't, so index has to -1. And openpyxl uses 1-based index while iloc uses 0-based, so index totally -2. This snip code may not be performance efficient because I just handle about 20x20 sheets. You can improve it on your own.
# %%
from openpyxl import load_workbook
import pandas as pd
file_name = "file.xlsx"
df = pd.read_excel(file_name, index_col=0, header=0)
wb = load_workbook(file_name)
sheet = wb.get_sheet_by_name(wb.sheetnames[0])
ms_set = wb.active.merged_cells
# %%
for ms in ms_set:
# 1-based
# (start col, start row, end col [included], end row [included])
b = ms.bounds
# this method is not efficient. Especially as you said, your file is large, but you may find a parallelized way to do this or some syntax sugar in python to speed up.
df.iloc[b[1]-2:b[3]-1, b[0]-2:b[2]-1] = df.iloc[b[1]-2, b[0]-2]
# %%
df
# %%
Upvotes: 0
Reputation: 601
To casually come back 8 years later, pandas.read_excel() can solve this internally for you with the index_col parameter.
df = pd.read_excel('path_to_file.xlsx', index_col=[0])
Passing index_col as a list will cause pandas to look for a MultiIndex. In the case where there is a list of length one, pandas creates a regular Index filling in the data.
Upvotes: 21
Reputation: 151
df = df.fillna(method='ffill', axis=0) # resolved updating the missing row entries
Upvotes: 15
Reputation: 881027
You could use the Series.fillna method to forword-fill in the NaN values:
df.index = pd.Series(df.index).fillna(method='ffill')
For example,
In [42]: df
Out[42]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
[8 rows x 3 columns]
In [43]: df.index = pd.Series(df.index).fillna(method='ffill')
In [44]: df
Out[44]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
Day 1 8312 13.60 3.50
Day 1 8321 19.80 5.88
Day 1 8322 13.50 4.09
Day 2 8311 16.00 4.92
Day 2 8312 5.67 2.28
Day 2 8321 13.00 4.34
Day 2 8322 10.60 1.95
[8 rows x 3 columns]
Upvotes: 87