Large csv to parquet using Dask - OOM

Question

I've 7 csv files with 8 GB each and need to convert to parquet.

Memory usage goes to 100 GB and I had to kill it . I tried with Distributed Dask as well . The memory is limited to 12 GB but no output produced for long time. FYI. I used to traditional pandas with Chunking + Prod consumer --> was able to convert in 30 mins What I'm missing for Dask processing ?

def ProcessChunk(df,...):
    df.to_parquet()     

for factfile in fArrFileList:
   df = dd.read_csv(factfile, blocksize="100MB",
                 dtype=fColTypes, header=None, sep='|',names=fCSVCols)
   result = ProcessChunk(df,output_parquet_file, chunksize, fPQ_Schema, fCSVCols, fColTypes)

Temple Jersey · Accepted Answer

Thanks all for suggestions. map_partitions worked.

df = dd.read_csv(filename, blocksize="500MB",
                     dtype=fColTypes, header=None, sep='|',names=fCSVCols)
df.map_partitions(DoWork,output_parquet_file, chunksize, Schema, CSVCols, fColTypes).compute(num_workers=2)

But the same approach for Dask Distributed Local Cluster didn't work well.when the csv size < 100 MB it worked in local cluster mode.

Large csv to parquet using Dask - OOM

Answers (2)

convert files to parquet

Related Questions