Reputation: 20267
I have a big file (25GB) that doesn't fit in to memory. I want to do some operations on this with Dask. I have tried two approaches, but both fail on memory errors.
>>> import dask.dataframe as dd
>>> df = dd.read_json('myfile.jsonl', lines=True)
MemoryError:
>>> # split file in 12 pieces with the unix split command
>>> # all of which by themselves fit in memory
>>> import dask.dataframe as dd
>>> df = dd.read_json('myfile_split.*', lines=True)
ValueError: Could not reserve memory block
What am I doing wrong here?
Upvotes: 0
Views: 1027
Reputation: 57271
I recommend using the blocksize=
keyword argument
df = dd.read_json(..., lines=True, blocksize="32 MiB")
Upvotes: 1