Reputation: 1607
I have about 12,000 csv files (50gb) stored in a google bucket that have almost similar column structures. However, there are some differences. Notably:
Overview of the goal:
Psuedo-Code Thought Process
Pre-cursor: I'm a bit lost as to how to approach this as I'm not very savvy with pandas
/numpy
. But, I do have a general idea of the steps I'd need to take in order to accomplish this.
pandas.read_csv(..., nrows=25)
to peek at each file's contentsstr
to every single column dtype
(to keep memory usage down)Note: The machine used is limited to about 16gb RAM and downloading the entire folder locally is also not viable due to storage limitations.
Once the files are normalized I can handle creating a BigQuery LoadJob to ingest the file data.
Upvotes: 1
Views: 252
Reputation: 16581
If all columns started on the first row, I would have used something like this (not tested):
from dask import delayed, compute
@delayed
def get_cols(file):
cols = pd.read_csv(file, nrows=1).columns
return {'file': file, 'cols': cols}
from glob import glob
files = glob('*csv') # or an appropriate list of files
cols = compute([get_cols(f) for f in files])
Next, I would convert cols into a dataframe and create a scheme that is satisfactory. This depends on your data and goals, so can't help much here.
Since there is the issue of files that do not start on the first row I would either use try/except
within the get_cols
or implement another logic to obtain the column names, which also depends on the data, so can't help more here.
Upvotes: 1