Ryan de Kock
Ryan de Kock

Reputation: 235

storing huge amounts of data in mongo

I am working on a front end system for a radius server.

The radius server will pass updates to the system every 180 seconds. Which means if I have about 15,000 clients that would be around 7,200,000 entries per day...Which is a lot.

I am trying to understand what the best possible way to store and retrieve this data will be. Obviously as time goes on, this will become substantial. Will MongoDB handle this? Typical document is not much, something this

{
 id: 1
 radiusId: uniqueId
 start: 2017-01-01 14:23:23
 upload: 102323
 download: 1231556
}

However, there will be MANY of these records. I guess this is something similar to the way that SNMP NMS servers handle data which as far as I know they use RRD to do this.

Currently in my testing I just push every document into a single collection. So I am asking,

A) Is Mongo the right tool for the job and B) Is there a better/more preferred/more optimal way to store the data

EDIT:

OK, so just incase someone comes across this and needs some help.

I ran it for a while in mongo, I was really not satisfied with performance. We can chalk this up to the hardware I was running on, perhaps my level of knowledge or the framework I was using. However I found a solution that works very well for me. InfluxDB pretty much handles all of this right out of the box, its a time series database which is effectively the data I am trying to store (https://github.com/influxdata/influxdb). Performance for me has been like night & day. Again, could all be my fault, just updating this.

EDIT 2:

So after a while I think I figured out why I never got the performance I was after with Mongo. I am using sailsjs as framework and it was searching by id using regex, which obviously has a huge performance hit. I will eventually try migrate back to Mongo instead of influx and see if its better.

Upvotes: 0

Views: 3913

Answers (1)

Hashcut
Hashcut

Reputation: 843

15,000 clients updating every 180 seconds = ~83 insertions / sec. That's not a huge load even for a moderately sized DB server, especially given the very small size of the records you're inserting.

I think MongoDB will do fine with that load (also, to be honest, almost any modern SQL DB would probably be able to keep up as well). IMHO, the key points to consider are these:

  • Hardware: make sure you have enough RAM. This will primarily depend on how many indexes you define, and how many queries you're doing. If this is primarily a log that will rarely be read, then you won't need much RAM for your working set (although you'll need enough for your indexes). But if you're also running queries then you'll need much more resources
  • If you are running extensive queries, consider setting up a replica set. That way, your master server can be reserved for writing data, ensuring reliability, while your slaves can be configured to serve your queries without affecting the write reliability.
  • Regarding the data structure, I think that's fine, but it'll really depend on what type of queries you wish to run against it. For example, if most queries use the radiusId to reference another table and pull in a bunch of data for each record, then you might want to consider denormalizing some of that data. But again, that really depends on the queries you run.
  • If you're really concerned about managing the write load reliably, consider using the Mongo front-end only to manage the writes, and then dumping the data to a data warehouse backend to run queries on. You can partially do this by running a replica set like I mentioned above, but the disadvantage of a replica set is that you can't restructure the data. The data in each member of the replica set is exactly the same (hence the name, replica set :-) Oftentimes, the best structure for writing data (normalized, small records) isn't the best structure for reading data (denormalized, large records with all the info and joins you need already done). If you're running a bunch of complex queries referencing a bunch of other tables, using a true data warehouse for the querying part might be better.
  • As your write load increases, you may consider sharding. I'm assuming the RadiusId points to each specific server among a pool of Radius servers. You could potentially shard on that key, which would split the writes based on which server is sending the data. Thus, as you increase your radius servers, you can increase your mongo servers proportionally to maintain write reliability. However, I don't think you need to do this right away as I bet one reasonably provisioned server should be able to manage the load you've specified.

Anyway, those are my preliminary suggestions.

Upvotes: 1

Related Questions