Reputation: 51
I am trying to load a few parquet files from a directory into Python for tensorflow/pytorch.
The files are too large to be loaded through the pyarrow.parquet functions
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('dir')
table = dataset.read()
This gives out of memory error
.
I have also tried using petastorm
, but that doesn't work for make_reader()
because it isn't of the petastorm
type.
with make_batch_reader('dir') as reader:
dataset = make_petastorm_dataset(reader)
When I used the make_batch_reader()
and then the make_petastorm_dataset(reader)
, it again gave an zip not iterable error
or something along those lines.
I am not sure how to load the file into Python for ML training. Some quick help would be greatly appreciated.
Thanks Zash
Upvotes: 5
Views: 6292
Reputation: 376
You can load entire data using dask using below code. You can also load only chucks of data whenever needed by computing only those lines using the index. [Assuming you have different index].
import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob
@delayed
def load_chunk(pth):
x = ParquetFile(pth).to_pandas()
x = x.drop('[unwanted_columns_to_save_space]',axis=1)
return x
files = glob.glob('./your_path/*.parquet')
ddf = dd.from_delayed([load_chunk(f) for f in files])
df = ddf.compute()
Upvotes: 2
Reputation: 35
For pyarrow
, you can list the directory with Python, iterate over *.parquet
files, open each one as pq.ParquetFile
, and read it one row group at a time. This will alleviate the memory pressure, but won't be super fast without parallelization.
For petastorm
, you are right to use make_batch_reader()
. Indeed, the error messages are not always helpful; but you can inspect the stack trace and investigate where in petastorm code it originates from.
Upvotes: 2