Reputation: 435
I am trying to find a means of starting to work with very large CSV files in Pandas, ultimately to be able to do some machine learning with XGBoost.
I am torn between using mySQL or some sqllite framework to manage chunks of my data; my issue is in the machine learning aspect of it later on, and in loading in chunks at a time to train the model.
My other thought was to use Dask
, which is built of off Pandas, but also has XGBoost functionality.
I'm not sure what the best starting point is and was hoping to ask for an opinion! I am leaning towards Dask
but I have not used it yet.
Upvotes: 3
Views: 5108
Reputation: 57281
This blogpost goes through an example using XGBoost on a large CSV dataset. However it did so by using a distributed cluster with enough RAM to fit the entire dataset in memory at once. While many dask.dataframe operations can operate in small space I don't think that XGBoost training is likely to be one of them. XGBoost seems to operate best when all data is available all the time.
Upvotes: 5
Reputation: 3056
I haven't tried this, but I would load your data into an hdf5 file using h5py. This library let's you store data on disk but access it like a numpy array. Therefore you are no longer constrained by memory for your dataset.
For the XGBoost part, I would use the sklearn API and pass in the h5py object as the X
value. I recommend the sklearn API since it accepts numpy like arrays for input which should let h5py objects work. Make sure to use a small value for subsample
otherwise you'll likely run out of memory fast.
Upvotes: 0