Reputation: 10011
I have multiple excel files as follows (you may download test data from here) which have 3 row as headers in which feature_row
are unique which could be used as column names of dataframes.
I need to merge them based on date
and keeping multiple headers.
df1:
country jp jp
0 feature_name stock_a stock_b
1 update_date 2021-09-15 2021-09-15
2 2006-6-2 100.0000 100.0000
3 2006-6-9 99.7800 100.2300
4 2006-6-16 97.7100 99.6300
5 2006-6-23 98.1100 98.4900
6 2006-6-30 97.5800 98.5700
df2:
country us us
0 feature_name cpi_food core_cpi
1 update_date 2021-09-24 2021-09-24
2 1999-12-30 1,000.0000 1,000.0000
3 2000-1-4 1,025.7899 1,020.5759
4 2000-1-5 1,032.8712 1,024.9281
5 2000-1-6 1,073.6569 1,050.7501
6 2000-1-7 1,113.7831 1,081.9492
7 2000-1-10 1,135.3109 1,102.0936
8 2000-1-11 1,102.9618 1,037.0133
df3:
country cn cn
0 feature_name iCPI_food iCPI_rent
1 update_date 2021-09-27 2021-09-27
2 2016-1-1 99.9712 99.9000
3 2016-1-2 99.9295 99.7684
4 2016-1-3 100.0744 100.1282
5 2016-1-4 99.8702 99.5830
6 2016-1-5 99.9254 99.8024
I'm able to loop excel files and merge them to one with code below, but the first and the third row of headers are missed.
import pandas as pd
import numpy as np
import glob
dfs = pd.DataFrame()
for file in glob.glob('./*.xlsx'):
df = pd.read_excel(file, skiprows=[0, 2])
df = df.rename(columns={'country': 'date'})
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d').astype(str)
df = df.sort_values(by='date')
print(df)
if dfs.empty:
dfs = df.copy()
else:
dfs = dfs.merge(df, on='date', how='outer')
dfs = dfs.sort_values(by='date')
dfs.to_excel('./data.xlsx')
How could I get an expected excel file like this one:
Updated result with the updated code of @jezrael:
Upvotes: 1
Views: 1987
Reputation: 862611
You can avoid remove MultiIndex
, convert first column to DatetimeIndex
and append to list of DataFrames, for join use concat
by DatetimeIndex
with default outer join:
dfs, cols = [], []
for file in glob.glob('./*.xlsx'):
df = pd.read_excel(file)
first = df.columns[0]
df = df.set_index(first)
df.index = pd.to_datetime(df.index)
df = df.sort_index()
dfs.append(df)
cols.extend(df.columns)
df_fin = pd.concat(dfs).reindex(cols, axis=1).reset_index()
df_fin.to_excel('./data.xlsx', merge_cells=False)
Another option:
dfs = []
for file in glob.glob('./*.xlsx'):
print(file)
df = pd.read_excel(file, header=[0, 1, 2], skipfooter=2)
first = df.columns[0]
df = df.set_index(first)
df.index = pd.to_datetime(df.index, format='%Y-%m-%d').astype(str)
df = df.sort_index()
dfs.append(df)
df_fin = pd.concat(dfs).reset_index()
df_fin = df_fin.dropna(how='all').sort_values(df_fin.columns[0])
df_fin.to_excel('./data.xlsx')
Upvotes: 1
Reputation: 145
df1 = pd.read_excel('data1.xlsx',header=[0,1,2])
df2 = pd.read_excel('data2.xlsx',header=[0,1,2])
df3 = pd.read_excel('data3.xlsx',header=[0,1,2])
pd.concat([df1,df2,df3],ignore_index = True)
Upvotes: 0