Reputation: 21
Working on a very fire-and-forget type of application, a web-crawling app that collects thousands and thousands of items (often times millions) from the internet and stores them in a nosql collection (currently using MongoDB for this). These collections are very volatile, meaning that they are created and dropped very rapidly. Data access is also very random, so in theory my application could create a collection while the system is live, and dropped while the system is live too -- also, a collection that was created months ago will be accessed randomly for updates and reads. I'm talking thousands and thousands of collections with potentially millions of documents each.
To make a long story short, the issue with MongoDB is that it seems to perform poorly in this context. Their cache and the WiredTiger engine is designed in such a way that isn't prepared to handle random access to collections and dynamically creating and dropping collections very well. Replication has become a nightmare, and often times, writes will stall and the database essentially gets incredibly backed up. Scaling my application to thousands and thousands of users appears to be a no-go with MongoDB, unfortunately.
So, with that said -- does anyone know of or can recommend a database that is suited for this type of workload? We take advantage of geo indexes and full-text indexes so that would basically be the only requirement. I'm open to learning about and experimenting with anything, preferably a graph database -- but performance and production readiness is key.
Upvotes: 0
Views: 541
Reputation: 14520
You don't say what is specifically problematic with your existing MongoDB deployment - "database is getting backed up" is not an actionable problem report.
You also have not mentioned sharding which is probably the first recommendation that would be made for the type of workload you described on MongoDB.
The impression I am getting is you have maybe a single replica set which is huge where you are doing heavy reads and writes all over the dataset AND you're doing DDL at the same time. I don't know which databases are designed for this type of workload but my first reaction is to separate the dataset into smaller pieces.
What MongoDB offers, in part, is an extremely rich query language over the entire dataset and support for both transactional and analytical use cases. My impression is many of the non-relational data stores (including my impression of Cassandra, though it goes back to 2010 or so and is not current) do not support this kind of spectrum of use cases. Sure they may offer better performance but at a much reduced feature set. So as an alternative I would consider for example sharding which moves more of the effort to the application from the database but you still get to keep MQL and ACID transactions for example if you want them.
I don't know how much tuning you've done - not to assume that you haven't done enough but the question you are asking here is basically "I have a 10 TB data set and I need a fast database for it". Given this level of detail the most you'll probably get is a list of data stores.
Upvotes: 0
Reputation: 87
Highly recommended for "fire and forget" use case Apache Cassandra or even better ScyllaDB (as of my understanding Cassandra on steroids rewritten from ground up in C++ for best performance). You can do google search for performance comparison both are outstanding in case of write performance (not so great on read performance, please pay attention I said "not so great", not bad or worst).
Apache Cassandra is free for commercial use, so this is another green light to go with it. The syntax is a lot like SQL (please not I said a lot like not SQL), so its relatively easy to learn fast. Beside we've run it successfully on GNU/Linux and Microsoft Windows servers clusters.
As delivered upon Cassandra, ScyllaDB pretty much the same syntax.
In my case, we've run Cassandra clusters for almost 3 year now, and migrated all our work flow and previous projects exclusively on top on Apache Cassandra. I could express only good impressions regarding performance, although the most difficult thing at the beginning is to understand the basic concepts of internal working and the Cassandra's way of thinking "query first before data model".
I hope it can help you a bit in your research quest.
Upvotes: 1