How can I use pandas/dask to normalize columns across 12k csv files (50gb) in a google bucket in a memory-friendly way?

Question

I have about 12,000 csv files (50gb) stored in a google bucket that have almost similar column structures. However, there are some differences. Notably:

Not all files have all columns.
Some files' header row does not start on the first row. Instead, there may be a merged block of cells that take up the first few rows and columns that provide a summary of report parameters used to generate the report.

Overview of the goal:

Gather an extensive list of column headers from all files
Use that extensive list of column headers to normalize all files, so they have matching schemas
- removing any rows that precede the header row
- adding any missing columns to each file
- adding the filename and google storage folder path as static columns to each file

Psuedo-Code Thought Process

Pre-cursor: I'm a bit lost as to how to approach this as I'm not very savvy with pandas/numpy. But, I do have a general idea of the steps I'd need to take in order to accomplish this.

Utilize pandas.read_csv(..., nrows=25) to peek at each file's contents
Maybe look at the total columns of the sheet's x-range vs each row's column count to determine when we reach the header columns row?
Always assign str to every single column dtype (to keep memory usage down)
?

Note: The machine used is limited to about 16gb RAM and downloading the entire folder locally is also not viable due to storage limitations.

Once the files are normalized I can handle creating a BigQuery LoadJob to ingest the file data.

How can I use pandas/dask to normalize columns across 12k csv files (50gb) in a google bucket in a memory-friendly way?

Answers (1)

Related Questions