Zash
Zash

Reputation: 51

How to load large multi file parquet files for tensorflow/pytorch

I am trying to load a few parquet files from a directory into Python for tensorflow/pytorch.

The files are too large to be loaded through the pyarrow.parquet functions

import pyarrow.parquet as pq
dataset = pq.ParquetDataset('dir')
table = dataset.read()

This gives out of memory error.

I have also tried using petastorm, but that doesn't work for make_reader() because it isn't of the petastorm type.

with make_batch_reader('dir') as reader:
  dataset = make_petastorm_dataset(reader)

When I used the make_batch_reader() and then the make_petastorm_dataset(reader), it again gave an zip not iterable error or something along those lines.

I am not sure how to load the file into Python for ML training. Some quick help would be greatly appreciated.

Thanks Zash

Upvotes: 5

Views: 6292

Answers (2)

Nishank Lakkakula
Nishank Lakkakula

Reputation: 376

You can load entire data using dask using below code. You can also load only chucks of data whenever needed by computing only those lines using the index. [Assuming you have different index].

import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob

@delayed
def load_chunk(pth):
    x = ParquetFile(pth).to_pandas()
    x = x.drop('[unwanted_columns_to_save_space]',axis=1)
    return x

files = glob.glob('./your_path/*.parquet')

ddf = dd.from_delayed([load_chunk(f) for f in files])
df = ddf.compute()

Upvotes: 2

megaserg
megaserg

Reputation: 35

For pyarrow, you can list the directory with Python, iterate over *.parquet files, open each one as pq.ParquetFile, and read it one row group at a time. This will alleviate the memory pressure, but won't be super fast without parallelization.

For petastorm, you are right to use make_batch_reader(). Indeed, the error messages are not always helpful; but you can inspect the stack trace and investigate where in petastorm code it originates from.

Upvotes: 2

Related Questions