Winston
Winston

Reputation: 326

Saving a csv file as many csv files based on column values using Dask

I have a large csv file, and suppose that it looks like this

ID,PostCode,Value
H1A0A1-00,H1A0A1,0
H1A0A1-01,H1A0A1,0
H1A0A1-02,H1A0A1,0
H1A0A1-03,H1A0A1,0
H1A0A1-04,H1A0A1,1
H1A0A1-05,H1A0A1,0
H1A1G7-0,H1A1G7,0
H1A1G7-1,H1A1G7,0
H1A1G7-2,H1A1G7,0
H1A1N6-00,H1A1N6,0
H1A1N6-01,H1A1N6,0
H1A1N6-02,H1A1N6,0
H1A1N6-03,H1A1N6,0
H1A1N6-04,H1A1N6,0
H1A1N6-05,H1A1N6,0
...

I want to split it up by PostCode values and save all rows with the same postal code as a CSV. I have tried

postals = data['PostCode'].unique()
for p in postals:
    df = data[data['PostCode'] == p]
    df.to_csv(directory + '/output/demographics/' + p + '.csv', header=False, index=False)

Is there a way to do this using Dask to leverage multiprocessing? Thanks

Upvotes: 1

Views: 1042

Answers (1)

rpanai
rpanai

Reputation: 13447

In case you want to save to parquet it's pretty easy

Parquet

import dask.dataframe as dd
import pandas as pd
import os 

fldr = 'data_pq'
data.to_parquet(fldr, partition_on="PostCode")

this save data for every postcode inside a folder called PostCode=xxxxxxx which cointains as many file as the number of partitions of your dask.dataframe.

CSV

Here I suggest you to use a custom function write_file.

import dask.dataframe as dd
import pandas as pd
import os 

fldr = "data_csv"
os.makedirs(fldr, exist_ok=True)


def write_file(grp):
    pc = grp["PostCode"].unique()[0]
    grp.to_csv(f"{fldr}/{pc}.csv",
               header=False,
               index=False)
    return None


data.groupby("PostCode")\
    .apply(write_file, meta=('x', 'f8'))\
    .compute()

# the same function works for pandas df too
# data.groupby("PostCode").apply(write_file)

You should check how it works performancewise and eventually play with scheduler.

Upvotes: 5

Related Questions