TheDataJanit0r
TheDataJanit0r

Reputation: 31

Connecting to a local Docker Spark Cluster

I am trying to connect to a spark cluster that I created locally from my laptop. the docker-compose I used is the following :


services:
  spark-master:
    image: docker.io/bitnami/spark:3.2.1
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - '7075:8080'
      - "7077:7077"
    volumes:
      - "./execution_scripts:/execution_scripts:rw"
  spark-worker:
    image: docker.io/bitnami/spark:3.2.1
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
  spark-worker-2:
    image: docker.io/bitnami/spark:3.2.1
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
  spark-worker-3:
    image: docker.io/bitnami/spark:3.2.1
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no

the image above is a bitnami image with 3 workers and 1 master. and the code i trying to connect through my jupyter notebook is the following:

import findspark
findspark.init()
findspark.find()
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Day1_1").master("spark://localhost:7077").getOrCreate()
df_NYTaxi =  spark.read.csv(file)

the error i get is the following after running the above code is the following :

: java.lang.NullPointerException
    at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:78)
    at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:518)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:596)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:829)
    ```
I have tried a lot of things but every time I just can't seem to connect to that docker image some how, or I can connect but the job times out. 
my local spark version is 3.2.1 and the image used has the same version. 

Upvotes: 0

Views: 2162

Answers (1)

TheDataJanit0r
TheDataJanit0r

Reputation: 31

So the workaround to that was to actually create a docker image with multiple containers and then connect to it through VS code and then run the scripts from inside.

here is the docker compose after modification

version: '2'

services:
  spark:
    build : .
    container_name: spark_master
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - '7075:8080'
      - "7077:7077"
    volumes:
      - "./execution_scripts:/execution_scripts:rw"
      - "./resources:/resources:rw"

    
  spark-worker:
    build : .
    container_name: spark_worker_1

    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    volumes:
      - "./execution_scripts:/execution_scripts:rw"
      - "./resources:/resources:rw"
 
  spark-worker-2:
    build : .
    container_name: spark_worker_2
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    volumes:
      - "./execution_scripts:/execution_scripts:rw"
      - "./resources:/resources:rw"

and the docker file for building this image is the following :

 FROM bitnami/spark:3.2.1 USER root
 
 # Installing package into Spark if needed
 # spark-shell --master local --packages "<package name>" RUN pip install findspark 
 EXPOSE 8080 
 EXPOSE 7075 
 EXPOSE 7077

after building this image(of course you need to create a 2 folders called execution_scripts and resources. you can attach to the running container in VS Code or any similar way from any other IDE.

Upvotes: 3

Related Questions