Reputation: 302
This question is kind of odd and complex, so bear with me, please.
I have several massive CSV files (GB size) that I am importing with pandas
. These CSV files are dumps of data collected by a data acquisition system, and I don't need most of it, so I'm using the usecols
parameter to filter out the relevant data. The issue is that not all of the CSV files have all of the columns I need (a property of the data system being used).
The problem is that, if the column doesn't exist in the file but is specified in usecols
, read_csv
throws an error.
Is there a straightforward way to force a specified column set in a dataframe and have pandas
just return blank rows if the column doesn't exist? I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.
Upvotes: 1
Views: 1883
Reputation: 504
I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.
Assuming some kind of master list all_cols_to_use
, can you do something like:
def parse_big_csv(csvpath):
with open(csvpath, 'r') as infile:
header = infile.readline().strip().split(',')
cols_to_use = sorted(set(header) & set(all_cols_to_use))
missing_cols = sorted(set(all_cols_to_use) - set(header))
df = pd.read_csv(csvpath, usecols=cols_to_use)
df.loc[:, missing_cols] = np.nan
return df
This assumes that you're okay with filling the missing columns with np.nan
, but should work. (Also, if you’re concatenating the data frames, the missing columns will be in the final df and filled with np.nan
as appropriate.)
Upvotes: 2