Rakesh
Rakesh

Reputation: 103

Spark and Cassandra through Python

I have huge data stored in cassandra and I wanted to process it using spark through python. I just wanted to know how to interconnect spark and cassandra through python. I have seen people using sc.cassandraTable but it isnt working and fetching all the data at once from cassandra and then feeding to spark doesnt make sense. Any suggestions?

Upvotes: 3

Views: 1880

Answers (2)

Marko Švaljek
Marko Švaljek

Reputation: 2101

I'll just give my "short" 2 cents. The official docs are totally fine for you to get started. You might want to specify why this isn't working, i.e. did you run out of memory (perhaps you just need to increase the "driver" memory) or is there some specific error that is causing your example not to work. Also it would be nice if you provided that example.

Here are some of my opinions/experiences that I had. Usually, not always, but most of the time you have multiple columns in partitions. You don't always have to load all the data in a table and more or less you can keep the processing (most of the time) within a single partition. Since the data is sorted within a partition this usually goes pretty fast. And didn't present any significant problem.

If you don't want the whole store in casssandra fetch to spark cycle to do your processing you have really a lot of the solutions out there. Basically that would be quora material. Here are some of the more common one:

  1. Do the processing in your application right away - might require some sort of inter instance communication framework like hazelcast of even better akka cluster this is really a wide topic
  2. spark streaming - just do your processing right away in micro batching and flush results for reading to some persistence layer - might be cassandra
  3. apache flink - use proper streaming solution and periodically flush state of the process to i.e. cassandra
  4. Store data into cassandra the way it's supposed to be read - this approach is the most adviseable (just hard to say with the info you provided)
  5. The list could go on and on ... User defined function in cassandra, aggregate functions if your task is something simpler.

It might be also a good idea that you provide some details about your use case. More or less what I said here is pretty general and vague, but then again putting this all into a comment just wouldn't make sense.

Upvotes: 0

RussS
RussS

Reputation: 16576

Have you tried the examples in the documentation.

Spark Cassandra Connector Python Documentation

 spark.read\
    .format("org.apache.spark.sql.cassandra")\
    .options(table="kv", keyspace="test")\
    .load().show()

Upvotes: 3

Related Questions