Reputation: 11
Hey People I would like to merge 2000 Csv files into one of 2000 sub-folders. Each sub-folder contains three Csv files with different names. so I need to select only one Csv from each folder.
I know the code for how to merge bunch of Csv files if they are in the same - folder.
import pandas as pd
import glob
path = r'Total_csvs'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
frame.to_csv('Total.csv',index=False)
But my problems with 2000 Csv files look totally different.
Folder structure is: Main folder (with in this 2000 subfolders, within subfolders I had multiple Csv Files and I need to select only one Csv file from this. Finally concating all 2000 Csv files.)
Coming to Naming Conventions (all the subfolders had different names, but the subfolder name and the Csv name within the subfolder is same)
Any suggestions or a sample code (how to read 2000 Csv from sub-folders) would be helpful.
Thanks in advance
Upvotes: 0
Views: 1850
Reputation: 36560
If you are using Python 3.5 or newer you could use glob.glob
in recursive manner following way:
import glob
path = r'Total_csvs'
all_csv = glob.glob(path+"/**/*.csv",recursive=True)
Now all_csv
is list of relative paths to all *.csv
inside Total_csv
and subdirectories of Total_csv
and subdirectories of subdirectories of Total_csv
and so on.
For example purpose lets assume that all_csv
is now:
all_csv = ['Total_csvs/abc/abc.csv','Total_csv/abc/another.csv']
So we need to get files with names correnponding to directory of their residence, this could be done following way:
import os
def check(x):
directory,filename = x.split(os.path.sep)[-2:]
return directory+'.csv'==filename
all_csv = [i for i in all_csv if check(i)]
print(all_csv) #prints ['Total_csvs/abc/abc.csv']
Now all_csv
is list of paths to all .csv
you are seeking and you can use it same way as you did with all_csv
in "flat" (non-recursive) case.
Upvotes: 1
Reputation: 2407
You can do it without joining paths:
import pathlib,pandas
lastparent=None
for ff in pathlib.Path("Total_csvs").rglob("*.csv"): # recursive glob
print(ff)
if(ff.parent!=lastparent): # process the 1st file in the dir
lastparent= ff.parent
df = pd.read_csv(str(ff),... )
...etc.
Upvotes: 0
Reputation: 3399
We can iterate on every subfolder, determine expected_csv_path
, check if it exists. If it exists, we add them to our all_files
list.
Try following:
import pandas as pd
import os
path = r'Total_csvs'
li = []
for f in os.listdir(path):
expected_csv_path = os.path.join(path, f, f + '.csv')
csv_exists = os.path.isfile(expected_csv_path)
if csv_exists:
df = pd.read_csv(expected_csv_path, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True, sort=False)
frame.to_csv('Total.csv',index=False)
Upvotes: 1
Reputation: 125
You can loop through all the subfolders using os.listdir
.
Since the CSV filename is the same as the subfolder name, simply use the subfolder name to construct the full path name.
import os
import pandas
folders = os.listdir("Total_csvs")
li = []
for folder in folders:
# Since they are the same name
selected_csv = folder
filename = os.path.join(folder, selected_csv + ".csv")
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
frame.to_csv('Total.csv',index=False)
Upvotes: 1