Reputation: 1114

Skip column if doesnt exist while creating df pandas python usecols

I am using pandas to load thousands of CSVs. However I am only interested in some columns which might not be present in all CSVs.

It appears that the argument usecols doesnt work if a column name specified there doesnt exist in one of the CSVs. What's the best workaround for this? Thanks

import pandas as pd
for fullPath in listFilenamesPath:
    df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
    df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
    nrFiles = nrFiles + 1
    print(nrFiles, "files converted")

Upvotes: 2

Answers (4)

Rohit

Reputation: 2017

This is a bit late, but the usecols parameter can be a callable function. To quote the docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True.

check_cols = ['name','hostname', 'application family']
df = pd.read_csv(
    fullPath,
    sep= ";" , 
    usecols = lambda x: x in check_cols
)

Upvotes: 4

elmo26

Reputation: 150

It appears read_csv throws ValueError when it cant find a column specified in the usecols param. I think you could either use a try catch block and skip the files which throw errors.

for fullPath in listFilenamesPath:
    try:
        df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
    except ValueError:
        pass

or catch the error try to parse the conflicting column names and retry with a subset. There is probably a cleaner way to do this.

import pandas as pd
import re

usecols = ['name','hostname', 'application family']
for fullPath in listFilenamesPath:
    usecols_ = usecols
    while usecols_:
        try:
            df = pd.read_csv(fullPath, sep= ";" , usecols = usecols_)
            break
        except ValueError as e:
            r = re.search(r"\[(.+)\]", str(e))
            missing_cols = r.group(1).replace("'","").replace(" ", "").split(",")
            usecols_ = [x for x in usecols_ if x not in missing_cols]   

    """
        rest of your code
    """

Upvotes: 3

Ankur Sinha

Reputation: 6669

A workaround could be to get column names that appear both in your usecols list (the list of columns you want to look for) as well as df.columns. You can then use this list of common column names to subset your df.

The code with necessary comments:

### the column names you want to look for in the dataframes
usecols = ['name','hostname', 'application family']

for fullPath in listFilenamesPath:
    ### read the entire dataframe without usecols
    df = pd.read_csv(fullPath, sep= ";")
    ### get the column names that appear in both usecols list as well as df.columns
    final_list = list(set(usecols) & set(df.columns))
    ### subset it using the final_list
    df = df[final_list]
    ### write your df to csv and continue as usual
    df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
    nrFiles = nrFiles + 1
    print(nrFiles, "files converted")

Demo:

Here is a csv with the df:

I want to look for the columns:

usecols = ['A', 'D', 'B']

I read the entire CSV. I get the common columns between the df and the columns I am looking for, in this case they are A and B, and subset it as follows:

df = pd.read_csv('test1.csv')
final_list = list(set(cols) & set(df.columns))
df = df[final_list]
print(df)

Output:

Upvotes: 5

bjschoenfeld

Reputation: 412

You could read in the entire csv without using usecols. This will allow you to check which columns the DataFrame has. If the DataFrame does not have the desired columns, you can ignore it or process it how ever you need.

Upvotes: 0

Skip column if doesnt exist while creating df pandas python usecols

Answers (4)

Demo:

Related Questions