Reputation: 1745
I am running PySpark Job through Airflow on Google Dataproc cluster.
This job downloads data from AWS S3 and stores it on Google Cloud Storage after processing. So, in order to access the S3 bucket from Google Dataproc by the executor, I am storing the AWS credentials in the environment variables(appending to /etc/environment) while creating the dataproc cluster through the instialization actions.
Using Boto3 I am fetching the credentials and then setting the Spark configuration.
boto3_session = boto3.Session()
aws_credentials = boto3_session.get_credentials()
aws_credentials = aws_credentials.get_frozen_credentials()
aws_access_key = aws_credentials.access_key
aws_secret_key = aws_credentials.secret_key
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_key)
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key)
Initialization Action file:
#!/usr/bin/env bash
#This script installs required packages and configures the environment
wget https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py
sudo pip install boto3
sudo pip install google-cloud-storage
echo "AWS_ACCESS_KEY_ID=XXXXXXXXXXXXX" | sudo tee --append /etc/environment
echo "AWS_SECRET_ACCESS_KEY=xXXxXXXXXX" | sudo tee --append /etc/environment
source /etc/environment
But I am getting following error: Which means my Spark process is unable to fetch the configuration from the environment variables.
18/07/19 22:02:16 INFO org.spark_project.jetty.util.log: Logging initialized @2351ms
18/07/19 22:02:16 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT
18/07/19 22:02:16 INFO org.spark_project.jetty.server.Server: Started @2454ms
18/07/19 22:02:16 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@75b67e54{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
18/07/19 22:02:16 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.7-hadoop2
18/07/19 22:02:17 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at cluster-1-m/10.164.0.2:8032
18/07/19 22:02:19 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1532036330220_0004
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
18/07/19 22:02:23 INFO DependencyResolver: ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
18/07/19 22:02:23 INFO hive.metastore: Trying to connect to metastore with URI thrift://cluster-1-m:9083
18/07/19 22:02:23 INFO hive.metastore: Connected to metastore.
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created local directory: /tmp/952f73b3-a59c-4a23-a04a-f05dc4e67d89_resources
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created HDFS directory: /tmp/hive/root/952f73b3-a59c-4a23-a04a-f05dc4e67d89
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created local directory: /tmp/root/952f73b3-a59c-4a23-a04a-f05dc4e67d89
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created HDFS directory: /tmp/hive/root/952f73b3-a59c-4a23-a04a-f05dc4e67d89/_tmp_space.db
18/07/19 22:02:24 INFO DependencyResolver: ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created local directory: /tmp/59c3fef5-6c9e-49c9-bf31-69634430e4e6_resources
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created HDFS directory: /tmp/hive/root/59c3fef5-6c9e-49c9-bf31-69634430e4e6
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created local directory: /tmp/root/59c3fef5-6c9e-49c9-bf31-69634430e4e6
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created HDFS directory: /tmp/hive/root/59c3fef5-6c9e-49c9-bf31-69634430e4e6/_tmp_space.db
Traceback (most recent call last):
File "/tmp/get_search_query_logs_620dea04/download_elk_event_logs.py", line 237, in <module>
aws_credentials = aws_credentials.get_frozen_credentials()
AttributeError: 'NoneType' object has no attribute 'get_frozen_credentials'
18/07/19 22:02:24 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@75b67e54{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
When I tried submitting the job manually after logging in to the dataproc node then the Spark job is fecthing the credentials and running fine.
Can someone help me with this?
Upvotes: 0
Views: 1796
Reputation: 1745
After playing for some time with the linux environment to get the AWS credentials from environment variables through boto3 session nothing worked for me. So, followed the boto3 documentation and modified the initialization action script as follow:
echo "[default]" | sudo tee --append /root/.aws/config
echo "aws_access_key_id = XXXXXXXX" | sudo tee --append /root/.aws/config
echo "aws_secret_access_key = xxxxxxxx" | sudo tee --append /root/.aws/config
There are several ways for accessing AWS credentials through boto3_session
one of that is through the ~/.aws/config
.
Upvotes: 2