Mason
Mason

Reputation: 27

Is there the best way to train binary classification with 1000 parquet files?

I'm training a binary classification model with a huge dataset in parquet format. However, it has a lot, I cannot fill all of the data into memory. Currently, I am doing like below but I'm facing out-of-memory problem.

files = sorted(glob.glob('data/*.parquet'))

@delayed
def load_chunk(path):
    return ParquetFile(path).to_pandas()

df = dd.from_delayed([load_chunk(f) for f in chunk])
df = df.compute()

X = df.drop(['label'], axis=1)
y = df['label']
# Split the data into training and testing sets

Is there the best way to do it without out-of-memory?

Upvotes: 0

Views: 94

Answers (1)

Kurumi Tokisaki
Kurumi Tokisaki

Reputation: 171

You dont need to load all data at once. Depends on the classification algorithm you are using whether support incremental training. In scikit learn, all estimators implementing the partial_fit API are candidates such as SGDClassifier. if you are using tensorflow, you can use tfio.experimental.IODataset to stream you parquet to DNN you are training on.

Upvotes: 0

Related Questions