Reputation: 366
I have a folder. In this folder are 48 xlsx files, but the count of the relevant files are 22. Them name of these 22 files have no structure, the only thing in common is that the filenames start with data. I would love to access this files and read them all into a dataframe. Doing this manually with the code line
df = pd.read_excel(filename, engine='openpyxl')
takes too long
The table structure is similar but not always exactly the same. How can I manage to solve this problem
Upvotes: 1
Views: 304
Reputation: 727
import os
import pandas as pd
dfs = {}
def get_files(extension, location):
xlsx_list = []
for root, dirs, files in os.walk(location):
for t in files:
if t.endswith(extension):
xlsx_list.append(t)
return xlsx_list
file_list = get_files('.xlsx', '.')
index = 0
for filename in file_list:
index += 1
df = pd.read_excel(filename, engine='openpyxl')
dfs[filename] = df
print(dfs)
each element in dfs
like dfs['file_name_here.xlsx']
accesses the data frame output from the read_excel
.
EDIT: that you can add additional criteria to filter through the XLSX files at the line if t.endswith(extension):
you can check out the beginning of the file like if t.startswith('data'):
too. Maybe combine them if t.startswith('data') and t.endswith(extension):
Upvotes: 2