passing parameters via dataproc workflow-templates

Question

I understand that dataproc workflow-templates is still in beta, but how do you pass parameters via the add-job into the executable sql? Here is a basic example:

#/bin/bash

DATE_PARTITION=$1

echo DatePartition: $DATE_PARTITION


# sample job
gcloud beta dataproc workflow-templates add-job hive \
--step-id=0_first-job \
--workflow-template=my-template \
--file='gs://mybucket/first-job.sql' \
--params="DATE_PARTITION=$DATE_PARTITION"


gcloud beta dataproc workflow-templates run $WORK_FLOW 

gcloud beta dataproc workflow-templates remove-job $WORK_FLOW --step-
id=0_first-job

echo `date`

Here is my first-job.sql file called from the shell:

SET hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
SET mapred.output.compress=true;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

USE mydb;

CREATE EXTERNAL TABLE if not exists data_raw (
      field1 string,
      field2 string
    )
    PARTITIONED BY (dt String)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '	'
    LOCATION 'gs://data/first-job/';

ALTER TABLE data_raw ADD IF NOT EXISTS PARTITION(dt="${hivevar:DATE_PARTITION}");

In the ALTER TABLE statement, what is the correct syntax? I’ve tried what feels like over 15 variations but nothing works. If I hard code it like this (ALTER TABLE data_raw ADD IF NOT EXISTS PARTITION(dt="2017-10-31");) the partition gets created, but unfortunately it needs to be parameterized.

BTW – The error I receive is consistently like this: Error: Error while compiling statement: FAILED: ParseException line 1:48 cannot recognize input near '${DATE_PARTITION}' ')' '' in constant

I am probably close but not sure what I am missing.

TIA, Melissa

tix · Accepted Answer

Thanks for trying out Workflows! First-class support for parameterization is part of our roadmap. However for now your remove-job/add-job trick is the best way to go.

Regarding your specific question:

Values passed via params are accessed as ${hivevar:PARAM} (see [1]). Alternatively, you can set --properties which are accessed as ${PARAM}
The brackets around params are not needed. If it's intended to handle spaces in parameter values use quotations like: --params="FOO=a b c,BAR=X"
Finally, I noticed an errant space here DATE_PARTITION =$1 which probably results in empty DATE_PARTITION value

Hope this helps!

[1] How to use params/properties flag values when executing hive job on google dataproc

passing parameters via dataproc workflow-templates

Answers (2)

Create an empty template

Add a job with a placeholder for the value you want to parameterize

Export the template configuration to a file

Edit the file to add a parameter

Import the changes

Add a managed cluster or cluster selector

Run your template with parameters

Related Questions