SRB
SRB

Reputation: 41

Merge csv files using dask

I am new to python. I am using dask to read 5 large (>1 GB) csv files and merge (SQL like) them into a dask dataframe. Now, I am trying to write the merged result into a single csv. I used compute() on dask dataframe to collect data into a single df and then call to_csv. However, compute() is slow in reading data across all partitions. I tried calling to_csv directly on dask df and it created multiple .part files (I didn't try merging those .part files into a csv). Is there any alternative to get dask df into a single csv or any parameter to compute() to gather data. I am using 6GB RAM with HDD and i5 processor.

Thanks

Upvotes: 4

Views: 5394

Answers (2)

vtnate
vtnate

Reputation: 163

As of Dask 2.4.0 you may now specify single_file=True when calling to_csv. Example: dask_df.to_csv('path/to/csv.csv', single_file=True)

Like @mrocklin said, I recommend using other file formats.

Upvotes: 2

MRocklin
MRocklin

Reputation: 57281

Dask.dataframe will not write to a single CSV file. As you mention it will write to multiple CSV files, one file per partition. Your solution of calling .compute().to_csv(...) would work, but calling .compute() converts the full dask.dataframe into a Pandas dataframe, which might fill up memory.

One option is to just avoid Pandas and Dask all-together and just read in bytes from multiple files and dump them to another file

with open(out_filename, 'w') as outfile:
    for in_filename in filenames:
        with open(in_filename, 'r') as infile:
            # if your csv files have headers then you might want to burn a line here with `next(infile)
            for line in infile:
                outfile.write(line + '\n')

If you don't need to do anything except for merge your CSV files into a larger one then I would just do this and not touch pandas/dask at all. They'll try to read the CSV data into in-memory data, which will take a while and which you don't need. If on the other hand you need to do some processing with pandas/dask then I would use dask.dataframe to read and process the data, write to many csv files, and then use the trick above to merge them afterwards.

You might also consider writing to a datastore other than CSV. Formats like HDF5 and Parquet can be much faster. http://dask.pydata.org/en/latest/dataframe-create.html

Upvotes: 4

Related Questions