Reputation: 41
I have created a Glue Dev Endpoint to test my code before deploying to AWS Glue. Below, you will find a screen shot of the project architecture. Project layout in gluelibrary/ there is config.ini I am able to successfully debug the code and have it run to completion. The way that I am calling the library in the DEV environment looks like this:
import sys
import os
import time
from configobj import ConfigObj
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3
config = ConfigObj('/home/glue/scripts/gluelibrary/config.ini')
This process successfully finds all of the variables that I defined in the config file and exits with an 'exit code 0'
Note: the library that i developed was .zipped and added to the s3 bucket where I told the Glue Job to look for the .zip.
However, when I am in Glue Console, and I try to implement the same code (with the exception of the file path) I get an error:
import sys
import os
import time
from configobj import ConfigObj
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3
from gluelibrary.helpers import get_date
from gluelibrary import
from gluelibrary.boto3_.s3_utils import delete_data_in_sub_directories, check_for_empty_bucket
from gluelibrary.boto3_.s3_utils import replace_data_in_sub_directories, check_bucket_existence
print('starting job.')
print(os.getcwd())
config = ConfigObj('/home/glue/gluelibrary/config.ini')
--conf spark.hadoop.yarn.resourcemanager.connect.max-wait.ms=60000 --conf spark.hadoop.fs.defaultFS=hdfs://IP_ADDRESS.internal:8020 --conf spark.hadoop.yarn.resourcemanager.address=IP_ADDRESS.internal:8032 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=18 --conf spark.executor.memory=5g --conf spark.executor.cores=4 --JOB_ID j_26c2ab188a2d8b7567006809c549f5894333cd38f191f58ae1f2258475ed03d1 --enable-metrics --extra-py-files s3://BUCKET_NAME/Python/gluelibrary.zip --JOB_RUN_ID jr_0292d34a8b82dad6872f5ee0cae5b3e6d0b1fbc503dca8a62993ea0f3b38a2ae --scriptLocation s3://BUCKET_NAME/admin/JOB_NAME --job-bookmark-option job-bookmark-enable --job-language python --TempDir s3://BUCKET_NAME/admin --JOB_NAME JOB_NAME YARN_RM_DNS=IP_ADDRESS.internal Detected region us-east-2 JOB_NAME = JOB_NAME Specifying us-east-2 while copying script. Completed 6.6 KiB/6.6 KiB (70.9 KiB/s) with 1 file(s) remaining download: s3://BUCKET_NAME/admin/JOB_NAME to ./script_2018-10-12-14-57-20.py SCRIPT_URL = /tmp/g-6cad80fb460992d2c24a6f476b12275d2a9bc164-362894612904031505/script_2018-10-12-14-57-20.py
Upvotes: 4
Views: 6753
Reputation: 3687
If you need to access extra files from within your Glue job you have to:
Copy each file to a location on S3 that Glue has access to
Include the full S3 key of each file, comma separated, in the extra-files
special parameter of your job
Glue will then add those files to the --files
param given to spark-submit
and you should be able to access them from within your Spark job as if they were in the working directory.
In your example you should be able to simply do:
config = ConfigObj("config.ini")
Upvotes: 6