Rafiul Sabbir
Rafiul Sabbir

Reputation: 636

Spark and Cassandra in same docker

I am unzipping a tar file in json and then saving these json files in cassandra using spark(2.4.0) and cassandra(3.11). I am running pyspark using a docker container and I have cassandra running in my local.

I have a bash script data_extractor.sh in docker which contains

$SPARK_HOME/bin/spark-submit --packages datastax:spark-cassandra-connector:2.4.0-s_2.11 data_extractor.py -f $1

and I am running docker using

docker run -it spark-docker data_extractor.sh data.tar

When I run spark-submit; it does the unzip task but can't connect to the cassandra database(I using cassandra-driver in pyspark and have run cassandra using cassandra -f)

I am getting the following error:

Start building Cassandra schemas.
Traceback (most recent call last):
  File "/app/data_extractor.py", line 83, in <module>
    schema_builder = CassandraSchemaGenerator(keyspace)
  File "/app/cql_schema_creator.py", line 8, in __init__
    self.cluster_conn = self.cluster.connect()
  File "cassandra/cluster.py", line 1278, in cassandra.cluster.Cluster.connect
  File "cassandra/cluster.py", line 1314, in cassandra.cluster.Cluster.connect
  File "cassandra/cluster.py", line 1301, in cassandra.cluster.Cluster.connect
  File "cassandra/cluster.py", line 2802, in cassandra.cluster.ControlConnection.connect
  File "cassandra/cluster.py", line 2845, in cassandra.cluster.ControlConnection._reconnect_internal
cassandra.cluster.NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})

I am getting error in following code:

from cassandra.cluster import Cluster


class CassandraSchemaGenerator:
    def __init__(self, keyspace):
        self.keyspace = keyspace
        self.cluster = Cluster(['127.0.0.1'], port=9042)
        self.cluster_conn = self.cluster.connect()

This is my docker file: https://pastebin.com/nSkxZ2Au

My questions are:

  1. How can I solve this issue if I run pyspark in docker and cassandra in local?

  2. Is there any way to run both spark and cassandra in the same container and run them without error?

  3. Am I doing something wrong in python code and/or cassandra settings? If yes, how to resolve that?

I have tried several Dockerfiles to achieve the 2nd point but failed. Also, this is the first time I am using spark and cassandra so consider me as a noob.

Thanks.

Upvotes: 0

Views: 648

Answers (1)

debduttoc
debduttoc

Reputation: 314

Since your cassandra is running in the host and pyspark is running inside the container. 127.0.0.1 inside the container is just a loopback to the container.

You need to access the host machine from inside the container.

The crudest way to do this is to use your host's IP instead of 127.0.0.1. But this causes problems because your host machine's IP might be changing all the time.

If you're on Docker for mac, you can use docker.for.mac.localhost instead of 127.0.0.1

But the ideal way would be to run two containers running cassandra and pyspark separately and connect them on the same network.

Please read the following to find how to https://docs.docker.com/v17.09/engine/userguide/networking/#default-networks

Upvotes: 1

Related Questions