RSM
RSM

Reputation: 673

Pandas: How to read xlsx files from a folder matching only specific names

I have a folder full of excel files and i have to read only 3 files from that folder and put them into individual dataframes.

File1: Asterix_New file_Jan2020.xlsx
File2: Asterix_Master file_Jan2020.xlsx
File3: Asterix_Mapping file_Jan2020.xlsx

I am aware of the below syntax which finds xlsx file from a folder but not sure how to relate it to specific keywords. In this case starting with "Asterix_"

files_xlsx = [f for f in files if f[-4:] == "xlsx"]

Also i am trying to put each of the excel file in a individual dataframe but not getting successful:

for i in files_xlsx:
    df[i] = pd.read_excel(files_xlsx[0])

Any suggestions are appreciated.

Upvotes: 0

Views: 3180

Answers (5)

sammywemmy
sammywemmy

Reputation: 28644

I suggest using pathlib. If all the files are in a folder:

from pathlib import Path
from fnmatch import fnmatch
folder = Path('name of folder')

Search for the files using glob. I will also suggest using fnmatch to include the files whose extensions are in capital letters.

iterdir allows you to iterate through the files in the folder

name is a method in pathlib that gives you the name of the file in string format

using the str lower method ensures that extensions such as XLSX, which is uppercase is captured

 excel_only_files = [xlsx for xlsx in folder.iterdir()
                     if fnmatch(xlsx.name.lower(),'asterix_*.xlsx')]

OR

#you'll have to test this, i did not put it though any tests
excel_only_files = list(folder.rglob('Asterix_*.[xlsx|XLSX]')

from there, you can run a list comprehension to read your files:

 dataframes = [pd.read_excel(f) for f in excel_only_files]

Upvotes: 3

rajah9
rajah9

Reputation: 12329

If you have read in the file names, you can make sure that it starts with and ends with the desired strings by using this list comprehension:

files = ['filea.txt', 'fileb.xlsx', 'filec.xlsx', 'notme.txt']
files_xlsx = [f for f in files if f.startswith('file') and f.endswith('xlsx')]
files_xlsx # ['fileb.xlsx', 'filec.xlsx']

The list comprehension says, "Give me all the files that start with file AND end with xlsx.

Upvotes: 0

Fourier
Fourier

Reputation: 2983

First generate a list of files you want to read in using glob (based on @cup's answer) and then append them to a list.

import pandas as pd
import glob

my_df_list = [pd.read_excel(f) for f in glob.iglob('Asterix_*.xlsx')]

Depending on what you want to achieve, you can also use a dict to allow for key-value pairs.

Upvotes: 1

schwab09
schwab09

Reputation: 51

At the end of the if statement you need to add another condition for files which also contain 'Asterix_':

files_xlsx = [f for f in files if f[-4:] == "xlsx" and "Asterix_" in f]

The f[-4:] == "xlsx" is to make sure the last 4 characters of the file name are xlsx and "Asterix_" in f makes sure that "Asterix_" exists anywhere in the file name.

To then read these using pandas try:

for file in excel_files:
    df = pd.read_excel(file)
    print(df)

That should print the result of the DataFrame read from the excel file

Upvotes: 0

cup
cup

Reputation: 8237

Use glob.glob to do your pattern matches

import glob

for i in glob.glob('Asterix_*.xlsx'):
    ...

Upvotes: 2

Related Questions