Reputation: 1114
I am using pandas to load thousands of CSVs. However I am only interested in some columns which might not be present in all CSVs.
It appears that the argument usecols doesnt work if a column name specified there doesnt exist in one of the CSVs. What's the best workaround for this? Thanks
import pandas as pd
for fullPath in listFilenamesPath:
df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
nrFiles = nrFiles + 1
print(nrFiles, "files converted")
Upvotes: 2
Views: 3710
Reputation: 2017
This is a bit late, but the usecols
parameter can be a callable function. To quote the docs:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True.
check_cols = ['name','hostname', 'application family']
df = pd.read_csv(
fullPath,
sep= ";" ,
usecols = lambda x: x in check_cols
)
Upvotes: 4
Reputation: 150
It appears read_csv throws ValueError when it cant find a column specified in the usecols param. I think you could either use a try catch block and skip the files which throw errors.
for fullPath in listFilenamesPath:
try:
df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
except ValueError:
pass
or catch the error try to parse the conflicting column names and retry with a subset. There is probably a cleaner way to do this.
import pandas as pd
import re
usecols = ['name','hostname', 'application family']
for fullPath in listFilenamesPath:
usecols_ = usecols
while usecols_:
try:
df = pd.read_csv(fullPath, sep= ";" , usecols = usecols_)
break
except ValueError as e:
r = re.search(r"\[(.+)\]", str(e))
missing_cols = r.group(1).replace("'","").replace(" ", "").split(",")
usecols_ = [x for x in usecols_ if x not in missing_cols]
"""
rest of your code
"""
Upvotes: 3
Reputation: 6669
A workaround could be to get column names that appear both in your usecols
list (the list of columns you want to look for) as well as df.columns
. You can then use this list of common column names to subset your df
.
The code with necessary comments:
### the column names you want to look for in the dataframes
usecols = ['name','hostname', 'application family']
for fullPath in listFilenamesPath:
### read the entire dataframe without usecols
df = pd.read_csv(fullPath, sep= ";")
### get the column names that appear in both usecols list as well as df.columns
final_list = list(set(usecols) & set(df.columns))
### subset it using the final_list
df = df[final_list]
### write your df to csv and continue as usual
df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
nrFiles = nrFiles + 1
print(nrFiles, "files converted")
Here is a csv with the df:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
I want to look for the columns:
usecols = ['A', 'D', 'B']
I read the entire CSV. I get the common columns between the df and the columns I am looking for, in this case they are A and B, and subset it as follows:
df = pd.read_csv('test1.csv')
final_list = list(set(cols) & set(df.columns))
df = df[final_list]
print(df)
Output:
B A
0 4 1
1 5 2
2 6 3
Upvotes: 5
Reputation: 412
You could read in the entire csv without using usecols
. This will allow you to check which columns the DataFrame has. If the DataFrame does not have the desired columns, you can ignore it or process it how ever you need.
Upvotes: 0