Tom Kim
Tom Kim

Reputation: 190

Data cleasning, migration in big, in-service database

Hi I'm a server developer and we have a big mysql database(biggest table has about 0.5 billion rows) running 24-7.

And there's a lot of broken data. Most of them are logically wrong and involves multi-source(multiple tables, s3). And since it's kinda logically complicated, we need Rails model to clean them (can't be done with pure sql queries)

Right now, I am using my own small cleansing framework and using AWS Auto Scaling Group to scale up instances and speed up. But since the database is in-service, I have to be careful(table locks and other stuffs) and limit the process amount.

So I am curious about

  1. How do you (or big companies) clean your data while the database is in-service?
    • Do you use temporary tables and swap? or just update/insert/delete to an in-service database?
    • Do you use a framework or library or solution to clean data in efficient way? (such as distributed processing)
  2. How do you detect messed up data real-time?
    • Do you use a framework or library or solution to detect broken data?

Upvotes: 0

Views: 53

Answers (1)

C dot StrifeVII
C dot StrifeVII

Reputation: 1885

So I have face a problem similar in nature to what you are dealing with but different in scale. this is how I would approach the situation.

  1. First access the infrastructure concerns, like can the data base be offline or restricted from use for a few hours for maintenance, if so read on.
  2. Next you need to define what constitutes as "broken data".
  3. Once you arrive at your definition of "broken data" translate a way to programmatically identify it.
  4. Write a script that leverages your programatic identification algorithm and run some test.
  5. Then back up you data records in preparation.
  6. Then given the scale of your data set you will probably need to increase your server resources as not to bottle neck the script.
  7. Run the script
  8. Test your data to assess the effectiveness of your script
  9. If needed adjust and rerun

Its possible to do this with out closing of the database for maintenance but I think you will get better results if you do. Also since this is a rails app I would look at the model validations that your app has and input field validations to prevent "broken data" in real time.

Upvotes: 1

Related Questions