what is the best way to store preprocessed data in machine learning pipeline?

Question

In my case, raw data is stored on NoSQL. Before training ML model, i should preprocess raw data on NoSQL. At this time, if i preprocess raw data, then what is the best way to keep prerocessed data? 1. keep it on memory 2. keep it on another table in NoSQL 3. can you recommend another options?

dijksterhuis · Accepted Answer

Depends on your use case, size of the data, tech stack and machine learning framework / library. Truth be told, without knowledge of your data and requirements, no-one on SO will be able to give you a complete answer.

In terms of passing data to the model/ running the model, load it in memory. Look at batching your data into the model if you hit memory limits. Or use an AWS EMR cluster!

For the question on storing the data, I’ll use the previous answer’s example of Spark and try to give some general rules.

If the processed data is “Big” and regularly accessed (eg once a month/week/day), then store it in a distributed manner, then load into memory when running the model.

For Spark, best bet is to write it as partitioned parquet files or to a Hive Data Warehouse.

The key thing about those two is that they are distributed. Spark will create N parquet files containing all your data. When it comes to reading the dataset into memory (before running your model), it can read from many files at once - saving a lot of time. Tensorflow does a similar thing with the TFRecords format.

If your NoSQL database is distributed, then you can potentially use that.

If it won’t be regularly accessed and is “small”, then just run the code from scratch & load into memory.

If the processing takes no time at all and it’s not used for other work, then there’s no point storing it. It’s a waste of time. Don’t even think about it. Just focus on your model, get the data in memory and get running.

If the data won’t be regularly accessed but is “Big”, then time to think hard!

You need to carefully think about the trade off of processing time vs. data storage capability.

How much will it cost to store this data? How often is it needed? Is it business critical? When someone asks for this, is it always a “needed to be done yesterday” request? Etc.

—-

what is the best way to store preprocessed data in machine learning pipeline?

Answers (2)

Related Questions