Reputation: 27
I'm training a binary classification model with a huge dataset in parquet format. However, it has a lot, I cannot fill all of the data into memory. Currently, I am doing like below but I'm facing out-of-memory problem.
files = sorted(glob.glob('data/*.parquet'))
@delayed
def load_chunk(path):
return ParquetFile(path).to_pandas()
df = dd.from_delayed([load_chunk(f) for f in chunk])
df = df.compute()
X = df.drop(['label'], axis=1)
y = df['label']
# Split the data into training and testing sets
Is there the best way to do it without out-of-memory?
Upvotes: 0
Views: 94
Reputation: 171
You dont need to load all data at once. Depends on the classification algorithm you are using whether support incremental training. In scikit learn, all estimators implementing the partial_fit API are candidates such as SGDClassifier. if you are using tensorflow, you can use tfio.experimental.IODataset to stream you parquet to DNN you are training on.
Upvotes: 0