Reputation: 73
I'm trying to adapt my Spark jobs that are currently running on an on-premise Hadoop cluster. I want to modify it so that it keeps supporting run on-premise and run on Google cloud.
I was thinking to have a method to detect if a given environment variable is defined to determine if the code is running in the cloud:
def run_on_gcp():
return is_defined(os.env["ENVIRONMENT_VARIABLE"])
I wanted to know what would be an ENVIRONMENT_VARIABLE
that is always defined on Google cloud and that is accessible from a Dataproc instance?
I was thinking of PROJECT_ID
OR BUCKET
, which such variable do you usually use? How do you usually detect programmatically where your code is running? Thanks
Upvotes: 1
Views: 928
Reputation: 2685
When you submit job to dataproc, you can assign your args. such as profile name, cluster name.
CMD="--job mytestJob \
--job-args path=gs://tests/report\
profile=gcp \
cluster_name=${GCS_CLUSTER}"
gcloud dataproc jobs submit pyspark \
--cluster ${GCS_CLUSTER} \
--py-files ${PY_FILES} \
--async \
${PY_MAIN} \
-- ${CMD}
Then you can dectect those args in your program .
environment = {
'PYSPARK_JOB_ARGS': ' '.join(args.job_args) if args.job_args else ''
}
job_args = dict()
if args.job_args:
job_args_tuples = [arg_str.split('=') for arg_str in args.job_args]
print('job_args_tuples: %s' % job_args_tuples)
job_args = {a[0]: a[1] for a in job_args_tuples}
print('\nRunning job %s ...\n environment is %s\n' % (args.job_name, environment))
os.environ.update(environment)
Upvotes: 1
Reputation: 7058
For this purpose you can use DATAPROC_VERSION
. If you submit the following PySpark job to Dataproc it will print out the version you are using (1.3 in my case):
#!/usr/bin/python
import pyspark, os
sc = pyspark.SparkContext()
print(os.getenv("DATAPROC_VERSION"))
Upvotes: 2