Reputation: 3
I am trying to extract the data from xlsx file from the multiple folders and also to get the folder name for each row to identify where the is extracted from. I am able to extract the data from all the folders however i am unable to get the folder names to the dataframe. Please help.
Folder structure -
Month-Year - 2020-02 Day folder - 2020-02-01
Under the day folder consist of xlsx files.
paths = []
arr = []
for root, dirs, files in os.walk(Full_Path):
for file in files:
if file.endswith(".xlsx"):
#print(os.path.join(root, file))
ab = os.path.join(root, file)
print(ab)
arr.append(paths)
paths.append(ab)
for lm in arr:
print(lm)
all_data = pd.DataFrame()
for f in paths:
df = pd.read_excel(f, sheet_name='In Queue', usecols = fields)
df['Date'] = lm
all_data = all_data.append(df,ignore_index=True)
I have also tried different ways but not getting the output.
Upvotes: 0
Views: 128
Reputation: 107567
Consider building a list of data frames to be concatenated once outside the loop. Also, use assign
to generate the Date column during the loop:
df_list = []
for root, dirs, files in os.walk(Full_Path):
for file in files:
if file.endswith((".xls", ".xlsx", ".xlsb", ".xlsm", ".odf")):
xl_file = os.path.join(root, file)
df = (pd.read_excel(xl_file, sheet_name='In Queue', usecols = fields)
.assign(Date = xl_file))
df_list.append(df)
final_df = pd.concat(df_list, ignore_index=True)
Upvotes: 1