Reputation: 20302
I am trying to loop through all CSV files in a folder and find all header names that are in all files. I am thinking the code would start like this...it needs treatment and enhancement, for sure.
import glob
import pandas as pd
csvs = glob.glob('C:\\my_path\' + '*.csv')
master_set = set()
for file in csvs:
this_df = pd.read_csv(file)
cols = set(this_df.columns)
master_set = master_set.intersection(cols)
print(master_set)
This is just looping through files in a folder, obviously. What I want to do is compare all CSV headers in one folder, and check for the matches (intersection) of all headers, and print that result. Does it make sense? I hope so. I will need to do a UNION of all these files at some point. I am trying to determine the best way to get all common headers together. This is the lowest common denominator of the whole data series.
So, if I have 4 files with this schema:
colA colB colC colD colE
And, I have one file with this schema:
colA colC colE colX colX
Then, this is want I to see:
colA colC colE
Upvotes: 0
Views: 370
Reputation: 858
I solved the problem in the following way. The code works for any number of CSV files.
import pandas as pd
import csv
import glob
path = r'PATH_TO_CSV_DIR' # use your path
all_files = glob.glob(path + "/*.csv")
df1 = pd.read_csv('PATH_TO_CSV_DIR/c15.csv') # to initialize the master_set
master_set = set(df1.columns)
for file in all_files:
this_df = pd.read_csv(file)
cols = set(this_df.columns)
master_set = master_set.intersection(cols)
w = csv.writer(open("CommonColumns.csv","w")) #it gives a CSV file of common header. The common header can be used later for other operations.
w.writerow(master_set)
print(master_set)
Upvotes: 0
Reputation: 734
You seem to be looking to:
The answer above will work for parts 2 and 3. For part 1, I would recommend something like the following to grab paths to all of the relevant files. Unless you have a fancier pattern to match, good old glob
is best.
import glob
csvs = glob.glob('PATH_TO_CSV_DIR' + '*.csv')
I would then create a master set against which you can compare the header of each new .csv:
import pandas as pd
master_set = set()
for file in csvs:
this_df = pd.read_csv(file)
cols = set(this_df.columns)
master_set = master_set.intersection(cols)
Something like that should append all the unique headers.
Upvotes: 2
Reputation: 460
Yes, you can do it but would require you to loop in a loop on list of files and store the results. As far as a sample, here is the code.
import pandas as pd
df1 = pd.read_csv("File1.csv")
df2 = pd.read_csv("File2.csv")
setA = set(df1.columns)
setB = set(df2.columns)
common = setA.intersection(setB)
Upvotes: 2