Mikhail_Sam
Mikhail_Sam

Reputation: 11208

Efficiently read big csv file by parts using Dask

Now I'm reading big csv file using Dask and do some postprocessing on it (for example, do some math, then predict by some ML model and write results to Database). Avoiding load all data in memory, I want to read by chunks of current size: read first chunk, predict, write, read 2nd chunk and etc.

I tried next solution using skiprows and nrows:

import dask.dataframe as dd
read_path = "medium.csv"

# Read by chunk
skiprows = 100000
nrows = 50000
res_df = dd.read_csv(read_path, skiprows=skiprows)
res_df = res_df.head(nrows)

print(res_df.shape)
print(res_df.head())

But I get error:

ValueError: Sample is not large enough to include at least one row of data. Please increase the number of bytes in sample in the call to read_csv/read_table

Also, as I understand, it will compute binary mask everytime ([False, False, ... , True, ...]) for all the data to find rows to load. How can we do it more efficient? Maybe use some distributed or delayed functions from dask?

Upvotes: 5

Views: 7424

Answers (1)

MRocklin
MRocklin

Reputation: 57251

Dask dataframe will partition the data for you, you don't need to use nrows/skip_rows

df = dd.read_csv(filename)

If you want to pick out a particular partition then you could use the partitions accessor

part = df.partitions[i]

However, you might also want to apply your functions in parallel.

df.map_partitions(process).to_csv("data.*.csv")

Upvotes: 4

Related Questions