Is there the best way to train binary classification with 1000 parquet files?

Question

I'm training a binary classification model with a huge dataset in parquet format. However, it has a lot, I cannot fill all of the data into memory. Currently, I am doing like below but I'm facing out-of-memory problem.

files = sorted(glob.glob('data/*.parquet'))

@delayed
def load_chunk(path):
    return ParquetFile(path).to_pandas()

df = dd.from_delayed([load_chunk(f) for f in chunk])
df = df.compute()

X = df.drop(['label'], axis=1)
y = df['label']
# Split the data into training and testing sets

Is there the best way to do it without out-of-memory?

Is there the best way to train binary classification with 1000 parquet files?

Answers (1)

Related Questions