Reputation: 7644
I have 4 excel files - 'a1.xlsx','a2.xlsx','a3.xlsx','a4.xlsx' The format of the files are same
for eg a1.xlsx looks like:
id code name
1 100 abc
2 200 zxc
... ... ...
i have to read this files in pandas dataframe and check whether the same value of code
column exists in multiple excel files or not.
something like this.
if code=100
exists in 'a1.xlsx','a3.xlsx'
, and code=200
exists only in 'a1.xlsx'
final dataframe should look like:
code filename
100 a1.xlsx,a3.xlsx
200 a1.xlsx
... ....
and so on
I have all the files in a directory and tried to iterate them through loop
import pandas as pd
import os
x = next(os.walk('path/to/files/'))[2] #list all files in directory
os.chdir('path/to/files/')
for i in range (0,len(x)):
df = pd.read_excel(x[i])
How to proceed? any leads?
Upvotes: 1
Views: 705
Reputation: 862406
Use:
import glob
#get all filenames
files = glob.glob('path/to/files/*.xlsx')
#list comprehension with assign new column for filenames
dfs = [pd.read_excel(fp).assign(filename=os.path.basename(fp).split('.')[0]) for fp in files]
#one big df from list of dfs
df = pd.concat(dfs, ignore_index=True)
#join all same codes
df1 = df.groupby('code')['filename'].apply(', '.join).reset_index()
Upvotes: 3