Reputation: 20302

Is there some way to find the intersect in multiple file names between multiple CSV headers?

I am trying to loop through all CSV files in a folder and find all header names that are in all files. I am thinking the code would start like this...it needs treatment and enhancement, for sure.

import glob
import pandas as pd

csvs = glob.glob('C:\\my_path\' + '*.csv')

master_set = set()

for file in csvs:
    this_df = pd.read_csv(file)
    cols = set(this_df.columns)
    master_set = master_set.intersection(cols)

print(master_set)

This is just looping through files in a folder, obviously. What I want to do is compare all CSV headers in one folder, and check for the matches (intersection) of all headers, and print that result. Does it make sense? I hope so. I will need to do a UNION of all these files at some point. I am trying to determine the best way to get all common headers together. This is the lowest common denominator of the whole data series.

So, if I have 4 files with this schema:

colA colB colC colD colE

And, I have one file with this schema:

colA colC colE colX colX

Then, this is want I to see:

colA colC colE

Upvotes: 0

Answers (3)

santobedi

Reputation: 858

I solved the problem in the following way. The code works for any number of CSV files.

import pandas as pd
import csv
import glob

path = r'PATH_TO_CSV_DIR' # use your path
all_files = glob.glob(path + "/*.csv")
df1 = pd.read_csv('PATH_TO_CSV_DIR/c15.csv') # to initialize the master_set
master_set = set(df1.columns)

for file in all_files:
    this_df = pd.read_csv(file)
    cols = set(this_df.columns)
    master_set = master_set.intersection(cols)

w = csv.writer(open("CommonColumns.csv","w")) #it gives a CSV file of common header. The common header can be used later for other operations. 
w.writerow(master_set)
print(master_set)

Upvotes: 0

mayosten

Reputation: 734

You seem to be looking to:

Identify the .csv files
Grab the headers
Intersect the headers

The answer above will work for parts 2 and 3. For part 1, I would recommend something like the following to grab paths to all of the relevant files. Unless you have a fancier pattern to match, good old glob is best.

import glob

csvs = glob.glob('PATH_TO_CSV_DIR' + '*.csv')

I would then create a master set against which you can compare the header of each new .csv:

import pandas as pd

master_set = set()

for file in csvs:
    this_df = pd.read_csv(file)
    cols = set(this_df.columns)
    master_set = master_set.intersection(cols)

Something like that should append all the unique headers.

Upvotes: 2

Pavan Kumar Polavarapu

Reputation: 460

Yes, you can do it but would require you to loop in a loop on list of files and store the results. As far as a sample, here is the code.

import pandas as pd
df1 = pd.read_csv("File1.csv")
df2 = pd.read_csv("File2.csv")
setA = set(df1.columns)
setB = set(df2.columns)
common = setA.intersection(setB)

Upvotes: 2

Is there some way to find the intersect in multiple file names between multiple CSV headers?

Answers (3)

Related Questions