Reputation: 636
I am unzipping a tar file in json and then saving these json files in cassandra using spark(2.4.0) and cassandra(3.11). I am running pyspark using a docker container and I have cassandra running in my local.
I have a bash script data_extractor.sh
in docker which contains
$SPARK_HOME/bin/spark-submit --packages datastax:spark-cassandra-connector:2.4.0-s_2.11 data_extractor.py -f $1
and I am running docker using
docker run -it spark-docker data_extractor.sh data.tar
When I run spark-submit; it does the unzip task but can't connect to the cassandra database(I using cassandra-driver in pyspark and have run cassandra using cassandra -f
)
I am getting the following error:
Start building Cassandra schemas.
Traceback (most recent call last):
File "/app/data_extractor.py", line 83, in <module>
schema_builder = CassandraSchemaGenerator(keyspace)
File "/app/cql_schema_creator.py", line 8, in __init__
self.cluster_conn = self.cluster.connect()
File "cassandra/cluster.py", line 1278, in cassandra.cluster.Cluster.connect
File "cassandra/cluster.py", line 1314, in cassandra.cluster.Cluster.connect
File "cassandra/cluster.py", line 1301, in cassandra.cluster.Cluster.connect
File "cassandra/cluster.py", line 2802, in cassandra.cluster.ControlConnection.connect
File "cassandra/cluster.py", line 2845, in cassandra.cluster.ControlConnection._reconnect_internal
cassandra.cluster.NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
I am getting error in following code:
from cassandra.cluster import Cluster
class CassandraSchemaGenerator:
def __init__(self, keyspace):
self.keyspace = keyspace
self.cluster = Cluster(['127.0.0.1'], port=9042)
self.cluster_conn = self.cluster.connect()
This is my docker file: https://pastebin.com/nSkxZ2Au
My questions are:
How can I solve this issue if I run pyspark in docker and cassandra in local?
Is there any way to run both spark and cassandra in the same container and run them without error?
Am I doing something wrong in python code and/or cassandra settings? If yes, how to resolve that?
I have tried several Dockerfiles to achieve the 2nd point but failed. Also, this is the first time I am using spark and cassandra so consider me as a noob.
Thanks.
Upvotes: 0
Views: 648
Reputation: 314
Since your cassandra is running in the host and pyspark is running inside the container. 127.0.0.1 inside the container is just a loopback to the container.
You need to access the host machine from inside the container.
The crudest way to do this is to use your host's IP instead of 127.0.0.1
. But this causes problems because your host machine's IP might be changing all the time.
If you're on Docker for mac, you can use docker.for.mac.localhost
instead of 127.0.0.1
But the ideal way would be to run two containers running cassandra and pyspark separately and connect them on the same network.
Please read the following to find how to https://docs.docker.com/v17.09/engine/userguide/networking/#default-networks
Upvotes: 1