Ksakser
Ksakser

Reputation: 37

MongoDB loads of daily data (or stick with mysql?)

In short, we have a business vehicle control via GPS currently working on mysql with a large volume of data. We are studying the possibility of moving to mongoDB as db but still do not see clear. Currently, we are testing with mongoHQ (daas mongoDB).

We have aprox 2000 gps sending information every minute with position, velocity and state. In one day, 24 hours, 1440 minutes, each gps sends 1440 tracks of information, so 1440tracks/device * 2000devices = 2.8M tracks / day.

Our first idea was to have a collection and store each track as a doc in this collection, but daily 2.8M generating tracks at the end of the month we have +-80M documents in a collection. We have to create daily, weekly or reports between dates, for example, if after 2 months of service, a client wants to see a report of 3 days, we would have to do a find of 1440 tracks/day * 3days inside about 160M documents ... how would be the response times? mongodb saturated? If multiple clients make similar requests at once, what would happen?

NOTE: Each document occupies approx 0.3KB , whereby each GPS per day occupies 0.3 * 1440 = 0.5MB , although greater storageSize ...

Second idea embedding.

Here we decided to group all the tracks on a daily document. Each has 1 gps doc / day and 1440 tracks of information are added $push into an array of tracks { }. Thus, each day we would have only 2k docs and by the end of the month just 60k instead of 80M! We thought 'we had found gold' until we realized that $pushing 1440 tracks inside each doc daily created reallocation whereby each document takes much longer and is not viable .. How could we improve embedding ? If the first idea was generating about 1GB storage daily, with this is about 3GB...

With the first idea, track = doc, gps need 0.5MB each day (a little bit more storagesize), which would be about 1GB daily for 2000 teams. About 30 GB per month, even with the heaviest of mongohq plan (600GB) or mongolab (400GB), we would have a maximum of 20 months of service before reaching the limits .. But the same in mysql after year and a half, we are not occupying more than 30GB .. :/

Currently, we do not see option to change and we have to stick with Mysql for now... any idea on how to make a good switch from mysql to nosql?

Upvotes: 1

Views: 1059

Answers (2)

RickyA
RickyA

Reputation: 16029

Welcome to big data...

What we do is this: We have a influx of log events at the rate of about 200 logs/sec. These logs are put in a database.collection called log.foo. You don't touch these records. Only new inserts are made here. NEVER EVER UPDATE THEM. It will lock your database and kill it's performance.

What you do is create a new database.collection called aggregate.foo. This is a new database because it will have its own write lock and will therefore not interfere with your log database.

Then you create a job that you run with cron or something similar. This job makes a query on log.foo for a given timeslice (ObjectId is very usefull for that). The job aggregates these lines as you see fit and puts new documents in aggregate.foo. Then you can choose to delete the rows from log.foo if you want, but storage is cheap, so why not keep them.

So essentially: combine your two ideas, but separate log insertion and aggregation.

Upvotes: 1

Philipp
Philipp

Reputation: 69663

When you know that your documents will reach a certain size, you can avoid reallocation by prepopulating them at creation.

When you know that your array will eventually have exactly 1440 entries, you can create the document with 1440 dummy-entries with the same set of fields, all filled with placeholder data which has the same length as the real data. When you then gradually add the real data, you replace these entries with $set instead of using $push.

To improve the performance of reports which aggregate data of past days, you could run a MapReduce job every night to aggregate the relevant data of the day into a new collection.

Upvotes: 1

Related Questions