Merge excel files with multiple rows of headers in Python

Question

I have multiple excel files as follows (you may download test data from here) which have 3 row as headers in which feature_row are unique which could be used as column names of dataframes.

I need to merge them based on date and keeping multiple headers.

df1:

        country           jp           jp
0  feature_name      stock_a      stock_b
1   update_date   2021-09-15   2021-09-15
2      2006-6-2     100.0000     100.0000
3      2006-6-9      99.7800     100.2300
4     2006-6-16      97.7100      99.6300
5     2006-6-23      98.1100      98.4900
6     2006-6-30      97.5800      98.5700

df2:

        country           us           us
0  feature_name     cpi_food     core_cpi
1   update_date   2021-09-24   2021-09-24
2    1999-12-30   1,000.0000   1,000.0000
3      2000-1-4   1,025.7899   1,020.5759
4      2000-1-5   1,032.8712   1,024.9281
5      2000-1-6   1,073.6569   1,050.7501
6      2000-1-7   1,113.7831   1,081.9492
7     2000-1-10   1,135.3109   1,102.0936
8     2000-1-11   1,102.9618   1,037.0133

df3:

        country           cn           cn
0  feature_name    iCPI_food    iCPI_rent
1   update_date   2021-09-27   2021-09-27
2      2016-1-1      99.9712      99.9000
3      2016-1-2      99.9295      99.7684
4      2016-1-3     100.0744     100.1282
5      2016-1-4      99.8702      99.5830
6      2016-1-5      99.9254      99.8024

I'm able to loop excel files and merge them to one with code below, but the first and the third row of headers are missed.

import pandas as pd
import numpy as np
import glob

dfs = pd.DataFrame()

for file in glob.glob('./*.xlsx'):
    df = pd.read_excel(file, skiprows=[0, 2])
    df = df.rename(columns={'country': 'date'})
    df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d').astype(str)
    df = df.sort_values(by='date')
    print(df)
    if dfs.empty:
        dfs = df.copy()
    else:
        dfs = dfs.merge(df, on='date', how='outer')
    dfs = dfs.sort_values(by='date')
    dfs.to_excel('./data.xlsx')

How could I get an expected excel file like this one:

Updated result with the updated code of @jezrael:

jezrael · Accepted Answer

You can avoid remove MultiIndex, convert first column to DatetimeIndex and append to list of DataFrames, for join use concat by DatetimeIndex with default outer join:

dfs, cols = [], []
for file in glob.glob('./*.xlsx'):
    df = pd.read_excel(file)

    first = df.columns[0]
    df = df.set_index(first)
    df.index = pd.to_datetime(df.index)
    df = df.sort_index()
    dfs.append(df)
    cols.extend(df.columns)

df_fin = pd.concat(dfs).reindex(cols, axis=1).reset_index()
df_fin.to_excel('./data.xlsx', merge_cells=False)

Another option:

dfs = []
for file in glob.glob('./*.xlsx'):
    print(file)
    df = pd.read_excel(file, header=[0, 1, 2], skipfooter=2)

    first = df.columns[0]
    df = df.set_index(first)
    df.index = pd.to_datetime(df.index, format='%Y-%m-%d').astype(str)
    df = df.sort_index()
    dfs.append(df)

df_fin = pd.concat(dfs).reset_index()
df_fin = df_fin.dropna(how='all').sort_values(df_fin.columns[0])
df_fin.to_excel('./data.xlsx')

Merge excel files with multiple rows of headers in Python

Answers (2)

Related Questions