arno_v
arno_v

Reputation: 20267

Reading data that doesn't fit in to memory with Dask

I have a big file (25GB) that doesn't fit in to memory. I want to do some operations on this with Dask. I have tried two approaches, but both fail on memory errors.

Approach 1

>>> import dask.dataframe as dd
>>> df = dd.read_json('myfile.jsonl', lines=True)
MemoryError:

Approach 2

>>> # split file in 12 pieces with the unix split command
>>> # all of which by themselves fit in memory
>>> import dask.dataframe as dd
>>> df = dd.read_json('myfile_split.*', lines=True)
ValueError: Could not reserve memory block

What am I doing wrong here?

Upvotes: 0

Views: 1027

Answers (1)

MRocklin
MRocklin

Reputation: 57271

I recommend using the blocksize= keyword argument

df = dd.read_json(..., lines=True, blocksize="32 MiB")

Upvotes: 1

Related Questions