Chris M.
Chris M.

Reputation: 302

Appending blank rows to dataframe if column does not exist

This question is kind of odd and complex, so bear with me, please.

I have several massive CSV files (GB size) that I am importing with pandas. These CSV files are dumps of data collected by a data acquisition system, and I don't need most of it, so I'm using the usecols parameter to filter out the relevant data. The issue is that not all of the CSV files have all of the columns I need (a property of the data system being used).

The problem is that, if the column doesn't exist in the file but is specified in usecols, read_csv throws an error.

Is there a straightforward way to force a specified column set in a dataframe and have pandas just return blank rows if the column doesn't exist? I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.

Upvotes: 1

Views: 1883

Answers (1)

pml
pml

Reputation: 504

I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.

Assuming some kind of master list all_cols_to_use, can you do something like:

def parse_big_csv(csvpath):
    with open(csvpath, 'r') as infile:
        header = infile.readline().strip().split(',')
        cols_to_use = sorted(set(header) & set(all_cols_to_use))
        missing_cols = sorted(set(all_cols_to_use) - set(header))
    df = pd.read_csv(csvpath, usecols=cols_to_use)
    df.loc[:, missing_cols] = np.nan
    return df

This assumes that you're okay with filling the missing columns with np.nan, but should work. (Also, if you’re concatenating the data frames, the missing columns will be in the final df and filled with np.nan as appropriate.)

Upvotes: 2

Related Questions