Rajendra Singh
Rajendra Singh

Reputation: 107

Unable to Connect Cassandra in EMR with bundle.zip with cluster mode

I am trying to connect Astra Cassandra in AWS EMR. but Executor are not able to get the bundle files as I am passing the file through S3.

this the spark submit command i passing.

--master yarn
--class com.proj.prog
--packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0,org.apache.hadoop:hadoop-aws:3.1.2
--conf spark.files=s3://.../connect/secure-connect-proj.zip
--conf spark.cassandra.connection.config.cloud.path=secure-connect-proj.zip

mode is cluster its working in client mode but not in cluster.

I also tried with but none worked.

--conf spark.cassandra.connection.config.cloud.path=s3://.../connect/secure-connect-proj.zip

This was the error in both cases.

diagnostics: User class threw exception: java.io.IOException: \
  Failed to open native connection to Cassandra \
  at Cloud File Based Config at secure-connect-proj.zip :: \
    The provided path secure-connect-proj.zip is not a valid URL \
    nor an existing locally path. Provide an URL accessible to all executors \
    or a path existing on all executors (you may use `spark.files` \
    to distribute a file to each executor).
Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times,  \
  most recent failure: Lost task 0.3 in stage 1.0 (TID 7) \
  (ip-172-31-17-85.ap-south-1.compute.internal executor 1): \
  java.io.IOException: Failed to open native connection to Cassandra \
  at Cloud File Based Config at s3://.../connect/secure-connect-proj.zip :: \
    The provided path s3://.../connect/secure-connect-proj.zip is not a valid URL \
    nor an existing locally path. Provide an URL accessible to all executors \
    or a path existing on all executors (you may use `spark.files` \
    to distribute a file to each executor).

Please help. I know I am missing something but I could not found a working solution.

Upvotes: 4

Views: 570

Answers (1)

Erick Ramirez
Erick Ramirez

Reputation: 16393

S3 URI

It's not clear from the examples you provided where you have specified the correct S3 URI. Make sure that the URI is one of the following forms:

s3://bucket_name/secure-connect-db_name.zip
s3://bucket_name/subdir/secure-connect-db_name.zip
s3://bucket_name/path/to/secure-connect-db_name.zip

I would suggest you update your original question and replace s3://... with s3://bucket_name to avoid confusion.

IAM roles and EMR

EMR uses EMRFS to access S3 data so you need to configure IAM roles for EMRFS requests. EMRFS uses permission policies attached to the service role for EC2 instances.

If it isn't configured correctly, this could be the reason EMR can't access the secure bundle. For details, see Configure IAM roles for EMRFS requests to Amazon S3.

Compatibility

Make sure that you're using the correct version of the spark-cassandra-connector. Version 3.1 of the connector works with Spark 3.1 which means it will only work with Amazon EMR 6.3.

If you're using Amazon EMR 5.33, it has Spark 2.4 so you'll need to use version 2.5 of the connector.

Test with spark-shell

Test connectivity by running spark-shell so it's easier to isolate the problem.

Theses are the required dependencies to run the test:

libraryDependencies += "org.apache.spark" % "spark-sql" % "3.1.2"
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector" % "3.1.0"

Start the spark-shell with:

spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0 \
  --master {master-url} \
  --conf spark.files=s3://bucket_name/secure-connect-db_name.zip \
  --conf spark.cassandra.connection.config.cloud.path=secure-connect-db_name.zip \
  --conf spark.cassandra.auth.username=client_id \
  --conf spark.cassandra.auth.password=client_secret \
  --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions

Finally, test the connection with:

import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("table_name", "keyspace_name").load
data.printSchema
data.show

Upvotes: 1

Related Questions