Pyspark, executors lost connectivity when performing a join

Question

My mesos-spark cluster:

Executors are crashing every time I try to do a .count() after a join, the count without the join works perfectly, not sure why but in failed queries I see:

And in the executor logs:

I don't see an specific OOM issue, what's the deal here? It seems to happen only when the join is made.

Followed @busfighter suggestions and set the dataframes to StorageLevel.MEMORY_ONLY before joining and reduced partitions using coalesce(). Still the same error.

Edit 1

Tried all comments, nothing:

Saving to memory the data
Repartitioned to 12 partition (was 200), to be added that after checking the spark/jobs web UI the executors are never specifically removed by Spark(Mesos) on my cluster
Changed value spark.sql.autoBroadcastJoinThreshold to 20 smaller the default value

Edit 2

At no given point, when the task fails the executors are removed, they just timeout on shuffle:

Edit 3

See that the data size is really small when it crashes, feeling lost and can't find the executor logs to see if it was killed becaues of OOM:

Edit 4

Some important notes:

The job works OK with only 1 slave (takes more time) but it doesnt crash, I don't think its an OOM issue.
Other parts of the code, that don't involve joining data (merely reading and transforming work OK)

Config used on PySpark

conf = (SparkConf()
        .setAppName('daily_etl')
        .setMaster(XXXXX)
        .set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.4.1")
        .set('spark.mesos.executor.home','/opt/spark')
        )

spark = SparkSession.builder\
    .config(conf=conf) \
    .getOrCreate()

Edit 5

Screenshot of the error:

Edit 6

Adding screenshot of the Mesos UI

Edit 7

Managed to narrow down the problem, for some reason BlockManager is listening to localhost, hence the other executors cannot conect:

Not sure why but will crate another topic.

Pyspark, executors lost connectivity when performing a join

Answers (1)

Related Questions