Radulescu Petru
Radulescu Petru

Reputation: 127

Pandas merged cell issue when reading from excel

So I have a xls file with some weird style but there's nothing I can do about it so I just need to parse it.

enter image description here

As you can see I have some merged cells. What I want to do is fill the empty values for the merged cells ("ffill") but also keep the empty cells like they are.

Something like this

EIM, C,NI1 Enescu_Ioan, EIM, S,NI11,Enescu_Ioan EIM, C,NI1 Enescu_Ioan, Empty EIM, C,NI1 Enescu_Ioan EIM, S,NI11,Enescu_Ioan EIM, C,NI1,Enescu_Ioan Empty The way I'm loading the file right now is this.

xl = pd.ExcelFile("data/file.xls")
df = xl.parse(0, header=None)

I've also tried to open the file like this and access the merged cells but I get an empty list.

book = xlrd.open_workbook("data/file.xls")
book.sheet_by_index(0).merged_cells # This is empty []

Is there any way I could achieve this? Thanks!

EDIT

There might be some confusions regarding the question so I'll try to explain better. The attached image is a subset of a larger file where the columns may appear in different order. What I'm trying to achieve is a way of differentiating between merged cells NAN values (in a merged cell only the first column has a value, the rest are all nan) and empty cells NAN.

Upvotes: 4

Views: 6073

Answers (3)

LaserJesus
LaserJesus

Reputation: 8550

I updated @radulescu-petru solution to work with openpyxl

import pandas as pd
import openpyxl

def read_excel(path):
    return openpyxl.load_workbook(path)

def parse_excel(excel_file, sheet_name):
    sheet = excel_file[sheet_name]
    df = pd.read_excel(file_path, sheet_name=sheet_name, header=None)
    return sheet, df
    
def fill_merged_na(sheet, dataframe):
    for merged_cell_range in sheet.merged_cells:
        # Get the start cell's row and column indices
        min_col, min_row, max_col, max_row = openpyxl.utils.cell.range_boundaries(str(merged_cell_range))
        # Get the value from the first cell
        base_value = df.iat[min_row-1, min_col-1]
        dataframe.iloc[min_row-1:max_row, min_col-1:max_col] = base_value
    return dataframe

excel_file = read_excel('some_file.xlsx')
sheet, df = parse_excel(excel_file, 'sheet A')
df_filled = fill_merged_na(sheet, df)
df_filled.head()

Upvotes: 1

Radulescu Petru
Radulescu Petru

Reputation: 127

Managed to find a fix

def read_excel(path):
    excel = None
    if path.endswith('xlsx'):
        excel = pd.ExcelFile(xlrd.open_workbook(path), engine='xlrd')
    elif path.endswith('xls'):
        excel = pd.ExcelFile(xlrd.open_workbook(path, formatting_info=True), engine='xlrd')
    else:
        raise ValueError("Could not read this type of data")
    return excel

def parse_excel(excel_file):
    sheet_0 = excel_file.book.sheet_by_index(0)
    df = excel_file.parse(0, header=None)
    return sheet_0, df

def fill_merged_na(sheet, dataframe):
    for e in sheet.merged_cells:
        rl, rh, cl, ch = e
        base_value = sheet.cell_value(rl, cl)
        dataframe.iloc[rl:rh, cl:ch] = base_value
    return dataframe

Some of the important bits are opening the excel file with the formatting_info set to True in order to also read formatting such as merged cells and the fill_merged_na function that fills only the merged nan values but leaves the initial empty cells as they were.

Upvotes: 6

ManojK
ManojK

Reputation: 1640

This is how the DataFrame looks like if you do df = pd.read_excel('path')

print(df)

                     Col1                     Col2
0  EIM, C,NI1 Enescu_Ioan  EIM, S,NI11,Enescu_Ioan
1                     NaN                      NaN
2                     NaN  EIM, S,NI11,Enescu_Ioan
3                     NaN                      NaN

Now, fill the value from the first entry of merged cell:

df['Col1'] = df['Col1'].fillna(method = 'ffill')

And use pandas.DataFrame.replace to change the NaN with blanks:

df['Col2'] = df['Col2'].replace(np.nan,'')

An example:

df = pd.DataFrame({'Col1' : ['EIM, C,NI1 Enescu_Ioan',np.nan,np.nan,np.nan], 
                   'Col2' : ['EIM, S,NI11,Enescu_Ioan',np.nan,'EIM, S,NI11,Enescu_Ioan',np.nan]})

print(df)

                     Col1                     Col2
0  EIM, C,NI1 Enescu_Ioan  EIM, S,NI11,Enescu_Ioan
1                     NaN                      NaN
2                     NaN  EIM, S,NI11,Enescu_Ioan
3                     NaN                      NaN

df['Col1'] = df['Col1'].fillna(method = 'ffill')
df['Col2'] = df['Col2'].replace(np.nan,'')
print(df)
                     Col1                     Col2
0  EIM, C,NI1 Enescu_Ioan  EIM, S,NI11,Enescu_Ioan
1  EIM, C,NI1 Enescu_Ioan                         
2  EIM, C,NI1 Enescu_Ioan  EIM, S,NI11,Enescu_Ioan
3  EIM, C,NI1 Enescu_Ioan                         

Upvotes: 1

Related Questions