Reputation: 397
I am writing a function in pandas that can read excel files from a working directory. Each of the excel files consists of multiple sheets, however the corresponding sheets in each file has the same column names and the number of sheets in each file are the same as well.
I would like to have a function that can merge/append each sheet from the different files such that sheet1 from all the files are merged into a dataframe, sheet2 from all the files are merged as second dataframe and so on. In the end, I would like to know the number of dataframes created.
For this purpose, I wrote the following code:
fpath = "/path to files/"
from os import walk
df = pd.DataFrame()
f = []
xls = []
dff = []
mypath = fpath
for (dirpath, dirnames, filenames) in walk(mypath):
f.extend(filenames)
break
for i in range(0, len(f)):
f[i] = mypath+"/"+f[i]
xls.append(pd.ExcelFile(f[i]))
cout = 0
for fil in range(0, len(xls)):
for sh in range(0, len(xls)):
if(cout <= len(xls)):
df = df.append(pd.read_excel(xls[sh], fil))
dff.append(df)
cout = cout + 1
I introduced the cout variable to control that after every merging/appending sheet 1 from all the files, the loop should break otherwise all the sheets are merged into a single dataframe.
Problem: The problem is that the function stops after returning only one dataframe in which the first sheets are merged. If I remove the "cout" switch, then all the sheets are merged. Can anyone help me in fixing the function code so that it 1)merges/append the corresponding sheets from each files, 2) make dataframe from (1), and return the dataframes? That way I will have a dataframe for each of the merged/appended sheet.
Can anyone help, Please?
Note: I am doing it in pandas but kindly suggest if you think there are better alternatives in R or any other programming language.
Upvotes: 0
Views: 126
Reputation: 18916
Ok, I looked through your code and I might have an answer for you without looping so much. Maybe it helps, maybe not.
As you point to one folder let us use listdir
instead. Use pd.ExcelFile
once to get the sheet names and then loop through all the sheet names and pd.concat
the different excel-files for each specific sheet_name.
import pandas as pd
import os
# Preparation
p = 'exceltest' #<-- folder name
files = [os.path.join(p,i) for i in os.listdir(p) if i.endswith('.xlsx')]
sheets = pd.ExcelFile(files[0]).sheet_names
# Dictionary holding the sheet_names as keys
dfs = {s: pd.concat(pd.read_excel(f, sheet_name=s) for f in files) for s in sheets}
# Only for demo purpose
print(dfs[sheets[0]])
In my example files (named Workbook1, Workbook2) with sheet_names (Sheet 1, Sheet 2) and (Matrix A,B rowbreak 1,2) this prints:
A B
0 1 2
0 1 2
Upvotes: 2