Leo
Leo

Reputation: 89

Import multiple excel sheets from different files into python and concatenate them into one dataframe

I am trying to combine multiple sheets from different excel files into one dataframe. All the files have multiple sheets, and one of these sheets has the same name in all the files - this is the sheet that I am interested in combining into one dataframe. All the files are in the same directory.

import pandas as pd
import os, glob
os.chdir(r'c:\Users\Documents\Files')

def files():                                 #to select the files that have RMP and WE on their name
    list_files= pd.Series()
    for file in glob.glob('RMP*WE*'):
        data= pd.Series(file)
        list_files= list_files.append(data, ignore_index=True)
    return list_files
a= files()
print("This is the variable a\n", a)

def extract_tab():                       #to concatenate the sheet called Metrics that all files have
    frame_files= pd.DataFrame()
    try:
        for file in a:
            data= pd.read_excel(file,sheet_name='Metrics')
            frame_files= frame_files.append(data, ignore_index=True)
    except:
        pass
    return  frame_files

b= extract_tab()

print("This is b\n",b)

The result of variable a (files function) is a list of the files that meet the naming criteria. But the result of variable b (extract_tab function) is an empty dataframe. What am I doing wrong?

I looked at this postImport multiple excel files into python pandas and concatenate them into one dataframe but it doesn't work... although I did take a couple of ideas from it.

Upvotes: 2

Views: 1986

Answers (2)

Umar.H
Umar.H

Reputation: 23099

IIUC, you can just do a list comp over your directory.

If using Python 3.4 +

from Pathlib import Path
path_ = 'c:\Users\Documents\Files'

dfs = [pd.read_excel(f,sheet_name='metrics') for f in Path(path_).glob('RMP*WE*')]

df = pd.concat(dfs)

or if you can only use the os module :

os.chdir('c:\Users\Documents\Files')
files = glob.glob('RMP*WE*')
dfs = [pd.read_excel(f,sheet_name='metrics') for f in files]
df = pd.concat(dfs)

Update.

If you need to handle missing sheets, this would be a nice way to do so.

def exlude_sheet(excel_list, sheet):
    """
    takes two arguments:
    1. A list of excel documents
    2. The name of your sheet.
    3. Returns a single data frame after 
    working through your list of excel objects. 
    """
    from xlrd import XLRDError
    df_lists = []
    for file in excel_list:
        try:
            file_df = pd.read_excel(file, sheet_name=sheet)
            df_lists.append(file_df)
        except (XLRDError) as e:
            print(f"{e} skipping")
            continue
    try:
        return pd.concat(df_lists)
    except ValueError as err:
        print("No Objects Matched")

Test.

xlsx = [f for f in Path(path_).glob('RMP*WE*')]
df = exlude_sheet(xlsx,sheet='Metrics')
out:
No sheet named <'Metrics'> for doc_1 skipping
No sheet named <'Metrics'> for doc_final skipping
print(df)
      Column_A  data
0           0     0
1           1     1
2           2     2
3           3     3
4           4     4

Test 2

Testing when no matching sheets are found at all :

exlude_sheet(xlsx,'foobar')
No sheet named <'foobar'> skipping
No sheet named <'foobar'> skipping
No sheet named <'foobar'> skipping
No Objects Matched

Upvotes: 4

Kenan
Kenan

Reputation: 14094

Try

import pandas as pd
from glob import glob

os.chdir('c:\Users\Documents\Files')
df = pd.concat([pd.read_excel(i, sheet_name='Metrics') for i in glob('RMP*WE*')])

Upvotes: 1

Related Questions