zoo_live_crew
zoo_live_crew

Reputation: 271

Using Spark in conjunction with Cassandra?

In our current infrastructure we use a Cassandra cluster as our backend database, and via Solr we use a web UI for our customers to perform read queries on our database as necessary.

I've been asked to look into Spark as something that we could implement in the future, but I'm having trouble understanding how it will improve what we currently do.

So my basic questions are:

1) Is Spark something that would replace Solr for querying the database, like when a user is looking something up on our site?

2) Just a general idea, what type of infrastructure would be necessary to improve our current situation (5 Cassandra nodes, all of which also run Solr). In other words, we would simple be looking at building another cluster of just Spark nodes?

3) Can Spark nodes run on the same physical machine as Cassandra? I'm guessing it would be a bad idea due to memory constraints as my very basic understanding of Spark is that it does everything in memory.

4) Any good quick/basic resources I can use to start figuring out how Spark might benefit us? I have access to Datastax Academy courses so I'm going through those, just wondering if there is anything else to help with my research.

Basically once I figure out what it is, and more importantly how/if it is something we can use to our advantage I'll start playing with some test instances, but I should probably familiarize myself with the basics first.

Upvotes: 0

Views: 102

Answers (1)

RussS
RussS

Reputation: 16576

1) No, Spark is a batch processing system and Solr is live indexing solution. Latency on solr is going to be sub second, Spark jobs are meant to take minutes (or more). There should really be no situation where Spark can be a drop in replacement for Solr.

2) I generally recommend a second Datacenter running both C* and Spark on the same machines. This will have the data from the first Datacenter via replication.

3) Spark Does not do everything in memory. Depending on your use case it can be a great idea to run on the same machines as C*. This can allow for data locality in reading from C* and help out significantly on table scan times. I usually also recommend colocating Spark Executors and C* nodes.

4) DS Academy 320 course is probably the best resource out there atm. https://academy.datastax.com/courses/getting-started-apache-spark

Upvotes: 6

Related Questions