Philip K. Adetiloye
Philip K. Adetiloye

Reputation: 3270

Spark-Cassandra Vs Spark-Elasticsearch

I have been using Elasticsearch for quite sometime now and little experience using Cassandra.

Now, I have a project we want to use spark to process the data but I need to decide if we should use Cassandra or Elasticsearch as the datastore to load my data.

In terms of connector, both Cassandra and Elasticsearch now has a good connector to load the data so that won't be deciding factor.

The winning factor to decide will be how fast I can load my data inside Spark. My data is almost 20 terabytes.

I know I can run some test using JMeter and see the result myself but I would like to ask anyone familiar with both systems.

Thanks

Upvotes: 2

Views: 1795

Answers (2)

azngunit81
azngunit81

Reputation: 1604

I will refute Evgenii answer about how ES is only good at searching. YES ES exceed at text search but it doesnt mean it can't do data.

You can actually treat it as if it was "Mongo" style Documentation and run "filter" queries on it to have fast fetch results. HOWEVER the question now becomes: how fast do you need your read/write and do you need any distributions? What ES lacks is distribution. Yes ES can do sharding but it has issues doing multi region distribution and reliability of replication of your data.

If you need the flexibility / reliability of your data I would swing to Cassanda. Also since you are dealing with TB - Cassandra might be a winner too because it is fitted for extreme volume.

If you need an easier time to to run searches (not limited to text search, eg: geo spacial you can do too) then ES might be a better fit. (note for the shear volume you are doing, you will need to shard in order to distribute your load).

Upvotes: 2

evgenii
evgenii

Reputation: 1235

The short exact answer is "it depends", mostly on cluster sizes =)

I wouldn't chose Elastisearch as a primary source for the data, because it's good at searching. Searching is a very specific task and it requires a very specific approach, which in this cases uses inverted index to store actual data. Each field basically goes into separate index and because of that the indexes are very compact. Although it's possible to store into index complete objects, such an index will hardly get any benefit of compression. That requires much more disk space to store indexes and much more cpu clocks, spinning disks to process them.

Cassandra on the other hand is pretty good at storing and retrieving data.

Without any more or less specific requirements, I'd say that Cassandra is good at being primary storage (and provides pretty simple search scenarios) and ES is good at searching.

Upvotes: 3

Related Questions