CaffeinatedMike
CaffeinatedMike

Reputation: 1607

How can I use pandas/dask to normalize columns across 12k csv files (50gb) in a google bucket in a memory-friendly way?

I have about 12,000 csv files (50gb) stored in a google bucket that have almost similar column structures. However, there are some differences. Notably:

Overview of the goal:

  1. Gather an extensive list of column headers from all files
  2. Use that extensive list of column headers to normalize all files, so they have matching schemas
    • removing any rows that precede the header row
    • adding any missing columns to each file
    • adding the filename and google storage folder path as static columns to each file

Psuedo-Code Thought Process

Pre-cursor: I'm a bit lost as to how to approach this as I'm not very savvy with pandas/numpy. But, I do have a general idea of the steps I'd need to take in order to accomplish this.

  1. Utilize pandas.read_csv(..., nrows=25) to peek at each file's contents
  2. Maybe look at the total columns of the sheet's x-range vs each row's column count to determine when we reach the header columns row?
  3. Always assign str to every single column dtype (to keep memory usage down)
  4. ?

Note: The machine used is limited to about 16gb RAM and downloading the entire folder locally is also not viable due to storage limitations.

Once the files are normalized I can handle creating a BigQuery LoadJob to ingest the file data.

Upvotes: 1

Views: 252

Answers (1)

SultanOrazbayev
SultanOrazbayev

Reputation: 16581

If all columns started on the first row, I would have used something like this (not tested):

from dask import delayed, compute
@delayed
def get_cols(file):
    cols = pd.read_csv(file, nrows=1).columns 
    return {'file': file, 'cols': cols}

from glob import glob
files = glob('*csv') # or an appropriate list of files
cols = compute([get_cols(f) for f in files])

Next, I would convert cols into a dataframe and create a scheme that is satisfactory. This depends on your data and goals, so can't help much here.

Since there is the issue of files that do not start on the first row I would either use try/except within the get_cols or implement another logic to obtain the column names, which also depends on the data, so can't help more here.

Upvotes: 1

Related Questions