rishab ajain445
rishab ajain445

Reputation: 187

PySpark job fails with missing dependencies when reading from S3 "SSLPeerUnverifiedException"

I am encountering an issue while running a PySpark job through a docker container to read data from an S3 bucket. Despite adding the required JAR dependencies for Hadoop AWS integration, the job fails with missing dependencies error. Below is my Dockerfile and PySpark script:

FROM python:3.9-slim-buster
USER root
WORKDIR /app
RUN apt-get update && apt-get install -y openjdk-11-jre-headless wget && \
    rm -rf /var/lib/apt/lists/*

RUN pip install --no-cache-dir pyspark==3.3.1
RUN mkdir /opt/jars/


RUN wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar -P /opt/jars/
RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.901/aws-java-sdk-bundle-1.11.901.jar -P /opt/jars/
RUN wget https://repo1.maven.org/maven2/joda-time/joda-time/2.10.14/joda-time-2.10.14.jar -P /opt/jars/
RUN wget https://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.5.13/httpclient-4.5.13.jar -P /opt/jars/
RUN wget https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-core/2.13.4/jackson-core-2.13.4.jar -P /opt/jars/
RUN wget https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-annotations/2.13.4/jackson-annotations-2.13.4.jar -P /opt/jars/
RUN wget https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-databind/2.13.4.2/jackson-databind-2.13.4.2.jar -P /opt/jars/


RUN mkdir -p /opt/spark/conf && touch /opt/spark/conf/hadoop-metrics2-s3a-file-system.properties
COPY query_s3_data.py .

#CMD ["spark-submit", "--jars", "/opt/jars/aws-java-sdk-bundle-1.12.678.jar,/opt/jars/hadoop-aws-3.3.2.jar", "query_s3_data.py"]
CMD ["spark-submit", "--jars", "/opt/jars/aws-java-sdk-bundle-1.11.901.jar,/opt/jars/hadoop-aws-3.3.1.jar,/opt/jars/joda-time-2.10.14.jar,/opt/jars/httpclient-4.5.13.jar,/opt/jars/jackson-core-2.13.4.jar,/opt/jars/jackson-annotations-2.13.4.jar,/opt/jars/jackson-databind-2.13.4.2.jar", "query_s3_data.py"]

pyspark script:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("S3Query") \
    .config("spark.jars", "/opt/jars/aws-java-sdk-bundle-1.12.678.jar,/opt/jars/hadoop-aws-3.3.2.jar") \
    .getOrCreate()

spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<key>")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<secret.key>")

spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")

df = spark.read.csv("s3a://sample.datasets/random.csv")
df.show()

error message:

24/03/19 05:27:34 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 6de12a7eaf0e, 38797, None) 24/03/19 05:27:34 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir. 24/03/19 05:27:34 INFO SharedState: Warehouse path is 'file:/app/spark-warehouse'. 24/03/19 05:27:35 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties 24/03/19 05:27:35 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 24/03/19 05:27:35 INFO MetricsSystemImpl: s3a-file-system metrics system started

Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for <sample.datasets.s3.amazonaws.com> doesn't match any of the subject alternative names: [.s3.amazonaws.com, s3.amazonaws.com] at com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507) at com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437) at com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384) at com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) at com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) at com.amazonaws.http.conn.$Proxy34.connect(Unknown Source) at com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) at com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) at com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) at com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1333) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145) ... 41 more Traceback (most recent call last): File "/app/query_s3_data.py", line 14, in df = spark.read.csv("s3a://sample.datasets/good-datasets/Cabs.csv") File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 535, in csv File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in call File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o29.csv. : org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on s3a://sample.datasets/good-datasets/Cabs.csv: com.amazonaws.SdkClientException: Unable to execute HTTP request: Certificate for <sample.datasets.s3.amazonaws.com> doesn't match any of the subject alternative names: [.s3.amazonaws.com, s3.amazonaws.com]: Unable to execute HTTP request: Certificate for <sample.datasets.s3.amazonaws.com> doesn't match any of the subject alternative names: [.s3.amazonaws.com, s3.amazonaws.com] at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3289) at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185) at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3053) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1760) at org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:4263) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:784) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:782) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426) at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Certificate for <sample.datasets.s3.amazonaws.com> doesn't match any of the subject alternative names: [.s3.amazonaws.com, s3.amazonaws.com] at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1207) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1153) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5227) at com.amazonaws.services.s3.AmazonS3Client.getBucketRegionViaHeadRequest(AmazonS3Client.java:6189) at com.amazonaws.services.s3.AmazonS3Client.fetchRegionFromCache(AmazonS3Client.java:6162) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5211) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5173) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1360) at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$6(S3AFileSystem.java:2066) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:412) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:375) at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2056) at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2032) at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3273) ... 20 more Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for <sample.datasets.s3.amazonaws.com> doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507) at com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437) at com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384) at com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) at com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) at jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) at com.amazonaws.http.conn.$Proxy34.connect(Unknown Source) at com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) at com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) at com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) at com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1333) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145) ... 39 more

24/03/18 14:07:27 INFO SparkContext: Invoking stop() from shutdown hook 24/03/18 14:07:27 INFO SparkUI: Stopped Spark web UI at http://365675a28f3d:4040 24/03/18 14:07:27 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 24/03/18 14:07:27 INFO MemoryStore: MemoryStore cleared 24/03/18 14:07:27 INFO BlockManager: BlockManager stopped 24/03/18 14:07:27 INFO BlockManagerMaster: BlockManagerMaster stopped 24/03/18 14:07:27 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 24/03/18 14:07:27 INFO SparkContext: Successfully stopped SparkContext 24/03/18 14:07:27 INFO ShutdownHookManager: Shutdown hook called 24/03/18 14:07:27 INFO ShutdownHookManager: Deleting directory /tmp/spark-5c7bb4af-4a35-46f8-9248-45cfebb6577f/pyspark-5077c08f-89f8-4ba2-bfc5-b5940c64356b 24/03/18 14:07:27 INFO ShutdownHookManager: Deleting directory /tmp/spark-4000837e-8a83-449d-90d9-3f97b43247f5 24/03/18 14:07:27 INFO ShutdownHookManager: Deleting directory /tmp/spark-5c7bb4af-4a35-46f8-9248-45cfebb6577f 24/03/18 14:07:27 INFO MetricsSystemImpl: Stopping s3a-file-system metrics system... 24/03/18 14:07:27 INFO MetricsSystemImpl: s3a-file-system metrics system stopped. 24/03/18 14:07:27 INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete.

Despite adding the required JARs for Hadoop AWS integration, the job fails with missing dependencies error. I have tried referring to numerous Stack Overflow articles and consulted the AWS support team. The AWS support team suggested adding compatible dependencies as listed in the official Hadoop documentation for S3A client.

Required Dependencies:

hadoop-aws jar. aws-java-sdk-s3 jar. aws-java-sdk-core jar. aws-java-sdk-kms jar. joda-time jar; use version 2.8.1 or later. httpclient jar. Jackson jackson-core, jackson-annotations, jackson-databind jars. I have tried different combinations of JAR versions but still face the same issue. Any insights on resolving this issue would be greatly appreciated.

I've referred to a bunch of stackoverflow articles as wel but nothing seems to work. one of the suggestions was to add an empty file in the classpath - https://stackoverflow.com/a/74182399/14879635 that didn't work either. Any help would be appreciated.

Upvotes: 0

Views: 284

Answers (1)

stevel
stevel

Reputation: 13430

not a JAR problem. HTTPS certificate matching as sample.datasets.s3.amazonaws.com isn't matching *.s3.amazonaws.com

As the AWS S3 Docs say,

For best compatibility, we recommend that you avoid using dots (.) in bucket names, except for buckets that are used only for static website hosting. If you include dots in a bucket's name, you can't use virtual-host-style addressing over HTTPS, unless you perform your own certificate validation. This is because the security certificates used for virtual hosting of buckets don't work for buckets with dots in their names.

S3AFS uses the hostname alone; it is never going to be changed as it is a complex issue, and, and as you can see from the certificate problem, insufficient to address your needs. See SPARK-32766. s3a: bucket names with dots cannot be used

Fix: use a supported bucket name for your data.

Upvotes: 1

Related Questions