Reputation: 75
I understand that dataproc workflow-templates is still in beta, but how do you pass parameters via the add-job into the executable sql? Here is a basic example:
#/bin/bash
DATE_PARTITION=$1
echo DatePartition: $DATE_PARTITION
# sample job
gcloud beta dataproc workflow-templates add-job hive \
--step-id=0_first-job \
--workflow-template=my-template \
--file='gs://mybucket/first-job.sql' \
--params="DATE_PARTITION=$DATE_PARTITION"
gcloud beta dataproc workflow-templates run $WORK_FLOW
gcloud beta dataproc workflow-templates remove-job $WORK_FLOW --step-
id=0_first-job
echo `date`
Here is my first-job.sql file called from the shell:
SET hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
SET mapred.output.compress=true;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
USE mydb;
CREATE EXTERNAL TABLE if not exists data_raw (
field1 string,
field2 string
)
PARTITIONED BY (dt String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 'gs://data/first-job/';
ALTER TABLE data_raw ADD IF NOT EXISTS PARTITION(dt="${hivevar:DATE_PARTITION}");
In the ALTER TABLE statement, what is the correct syntax? I’ve tried what feels like over 15 variations but nothing works. If I hard code it like this (ALTER TABLE data_raw ADD IF NOT EXISTS PARTITION(dt="2017-10-31");) the partition gets created, but unfortunately it needs to be parameterized.
BTW – The error I receive is consistently like this: Error: Error while compiling statement: FAILED: ParseException line 1:48 cannot recognize input near '${DATE_PARTITION}' ')' '' in constant
I am probably close but not sure what I am missing.
TIA, Melissa
Upvotes: 2
Views: 2211
Reputation: 111
Update: Dataproc now has workflow template parameterization, a beta feature:
https://cloud.google.com/dataproc/docs/concepts/workflows/workflow-parameters
For your specific case, you can do the following:
gcloud beta dataproc workflow-templates create my-template
gcloud beta dataproc workflow-templates add-job hive \
--step-id=0_first-job \
--workflow-template=my-template \
--file='gs://mybucket/first-job.sql' \
--params="DATE_PARTITION=PLACEHOLDER"
gcloud beta dataproc workflow-templates export my-template \
--destination=hive-template.yaml
jobs:
- hiveJob:
queryFileUri: gs://mybucket/first-job.sql
scriptVariables:
DATE_PARTITION: PLACEHOLDER
stepId: 0_first-job
parameters:
- name: DATE_PARTITION
fields:
- jobs['0_first-job'].hiveJob.scriptVariables['DATE_PARTITION']
gcloud beta dataproc workflow-templates import my-template \
--source=hive-template.yaml
gcloud beta dataproc workflow-templates set-managed-cluster my-template \
--cluster-name=my-cluster \
--zone=us-central1-a
gcloud beta dataproc workflow-templates instantiate my-template \
--parameters="DATE_PARTITION=${DATE_PARTITION}"
Upvotes: 3
Reputation: 2158
Thanks for trying out Workflows! First-class support for parameterization is part of our roadmap. However for now your remove-job/add-job trick is the best way to go.
Regarding your specific question:
Values passed via params are accessed as ${hivevar:PARAM}
(see [1]). Alternatively, you can set --properties
which are accessed as ${PARAM}
The brackets around params are not needed. If it's intended to handle spaces in parameter values use quotations like: --params="FOO=a b c,BAR=X"
Finally, I noticed an errant space here DATE_PARTITION =$1
which probably results in empty DATE_PARTITION
value
Hope this helps!
[1] How to use params/properties flag values when executing hive job on google dataproc
Upvotes: 1