Sami Korhonen
Sami Korhonen

Reputation: 1303

High-performance storage for messages

I have been looking for high-performance file storage solution to be used for persisting soap messages in Java EE environment.

We are currently using a CLOB table on Oracle RMDBS, but it is very expensive to scale. While oracle works well for storing the related metadata, it doesn't perform too well with the message content. Insert on a table with a CLOB gives roughly 1000% worse performance than one without it (This was measured by comparing performance of VARCHAR2(4000)-insert to CLOB-insert when in row storage has been disabled for CLOB)

Persisting messages on file system is one option, but I have some serious doubts how an average file systems would perform storing millions of files per day. Considering we have to keep those files for several months, it just doesn't sound right.

I know there are several open source key-value databases (jackrabbit, mongodb to name few) that might be up for the task, but I just can't find time to evaluate them all. I would also like to hear about performance of open source RMDBS.

Considering that volume of transmitted messages is ever increasing, priority is on low latency and high performance. We do not require clustering or transactionality and (minor) data loss on system failure is acceptable.

Requirements:

Help is appreciated

Upvotes: 1

Views: 316

Answers (4)

Sami Korhonen
Sami Korhonen

Reputation: 1303

This is what I've discovered so far. I will try to update this answer after evaluating each product.

I started my experiments using MongoDB, which on paper looked like a viable option. Here's a summary of my findings:

  • Written in C++
  • Replication (replicaset) requires 3 nodes for high availability
  • One of the nodes is elected as a master - only the master can write
  • Scaling out is done by sharding (partitioning)
  • Each shard is essentially a replicaset - therefore sharded environment requires atleast 6 nodes for high availability
  • mongod instance consumes all available memory - virtualization should be used for resource partitioning (if you intend to run application server on same hardware)
  • Master re-election may take up to 1minute
  • Document collections (tables) use exclusive lock during write operation
  • Java API is exceptionally easy to use and includes a virtual filesystem called GridFS
  • Single node write performance on test system was ~20000 inserts/sec for 1kbyte document
  • Single node read performance was ~20000 read/sec for 1kbyte document

The fact that MongoDB would require 6 nodes on a two data center configuration, made me look further for more cost-efficient solutions.

Apache Cassandra:

  • Written in Java
  • Replication requires 3 nodes for high availability
  • Database survives network partitioning
  • Replication algorithm has been designed for multiple data centers
  • All nodes are writable
  • Scaling out can be done by adding more nodes (up to a certain limit)
  • Cassandra may require JVM garbage collection tuning
  • Java API is not the easiest to work with
  • Single node write performance was ~7000 inserts/sec for 1kbyte document
  • Single node read performance was ~7000 reads/sec for 1kbyte document

While Cassandra was slower in a single node configuration, write performance on a high availability configuration would match MongoDB's performance. The ability perform writes on every node (even during network partitioning) is very welcome addition for logging.

Couchbase:

Unfortunately I was unable to test Couchbase.

For now we'll keep using Oracle SecureFiles. Would we run out of resources on Oracle, both Cassandra and MongoDB seem like viable alternatives.

Upvotes: 0

kalyan
kalyan

Reputation: 46

Oracle11g has the data deduplication featured introduced. This feature will improve the performance of the oracle database with clob.

Upvotes: 0

Alexander Jardim
Alexander Jardim

Reputation: 2476

You can try the following products:

  • HBase
  • MongoDB
  • Cassandra
  • Solr 4.0 (only)

These are the guys that I have any experience. There are a lot of other good products that can do what you want in the market.

Some observations: none of them have this "delete by age" feature out-of-the-box, as far as I know it. But it should be really simple to implement it. Easier in MogoDB I must assume.

If you will try Solr, you should stick with versions 4.X as these are the only ones with support to near realtime commits, and it will affect your "delete and insert" requirement.

All of them have great performance, but I did not run a benchmark with your requirement. If I were you I would make my own benchmarks.

Upvotes: 1

Petr Mensik
Petr Mensik

Reputation: 27516

Here is nice comparison between MongoDB and SQL Server (I believe Oracle will have similar performance). You can see from charts that Mongo can handle 20 000 inserts per second. Mongo has also query language based on JSON which can do almost everything like regular SQL and it has Sharded Clusters and Replica sets which can handle all neccesary backups and failover (some basic info here).

Also, if you are interested in digging little bit deeper, 10 gen has an online course starting in two weeks awarded with a certificate.

Upvotes: 1

Related Questions