Reputation: 11208
Now I'm reading big csv file using Dask and do some postprocessing on it (for example, do some math, then predict by some ML model and write results to Database). Avoiding load all data in memory, I want to read by chunks of current size: read first chunk, predict, write, read 2nd chunk and etc.
I tried next solution using skiprows
and nrows
:
import dask.dataframe as dd
read_path = "medium.csv"
# Read by chunk
skiprows = 100000
nrows = 50000
res_df = dd.read_csv(read_path, skiprows=skiprows)
res_df = res_df.head(nrows)
print(res_df.shape)
print(res_df.head())
But I get error:
ValueError: Sample is not large enough to include at least one row of data. Please increase the number of bytes in
sample
in the call toread_csv
/read_table
Also, as I understand, it will compute binary mask everytime ([False, False, ... , True, ...]) for all the data to find rows to load. How can we do it more efficient? Maybe use some distributed or delayed functions from dask?
Upvotes: 5
Views: 7424
Reputation: 57251
Dask dataframe will partition the data for you, you don't need to use nrows/skip_rows
df = dd.read_csv(filename)
If you want to pick out a particular partition then you could use the partitions accessor
part = df.partitions[i]
However, you might also want to apply your functions in parallel.
df.map_partitions(process).to_csv("data.*.csv")
Upvotes: 4