btomtom5
btomtom5

Reputation: 860

Machine Learning Development Workflow for Large Datasets

What workflow do you use when you have a large dataset of 300GB and your computer only has 250gb of memory?

Definitely use a dev set locally, but do you put the 300gb on an S3 bucket for production so that it is easy to power down the AWS when you are not using it and so that it is easy to extract the model when the computation is done?

I did a couple of basic measurements and it takes 5 seconds on average to load a file from s3. Does S3 perform significantly better when the files are in bigger chunks?

Upvotes: 0

Views: 78

Answers (1)

wind
wind

Reputation: 1020

It depends (as usual). :)

  1. You can try to filter your data during load (corrupted examples, outliers, etc.).
  2. If you need all data at once you can use distributed computing for that (look at http://spark.apache.org - a popular distributed computation framework) with some Machine Learning library working on it (e.g. https://spark.apache.org/mllib/).

Upvotes: 1

Related Questions