Melissa
Melissa

Reputation: 75

passing parameters via dataproc workflow-templates

I understand that dataproc workflow-templates is still in beta, but how do you pass parameters via the add-job into the executable sql? Here is a basic example:

#/bin/bash

DATE_PARTITION=$1

echo DatePartition: $DATE_PARTITION


# sample job
gcloud beta dataproc workflow-templates add-job hive \
--step-id=0_first-job \
--workflow-template=my-template \
--file='gs://mybucket/first-job.sql' \
--params="DATE_PARTITION=$DATE_PARTITION"


gcloud beta dataproc workflow-templates run $WORK_FLOW 

gcloud beta dataproc workflow-templates remove-job $WORK_FLOW --step-
id=0_first-job

echo `date`

Here is my first-job.sql file called from the shell:

SET hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
SET mapred.output.compress=true;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

USE mydb;

CREATE EXTERNAL TABLE if not exists data_raw (
      field1 string,
      field2 string
    )
    PARTITIONED BY (dt String)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LOCATION 'gs://data/first-job/';

ALTER TABLE data_raw ADD IF NOT EXISTS PARTITION(dt="${hivevar:DATE_PARTITION}");

In the ALTER TABLE statement, what is the correct syntax? I’ve tried what feels like over 15 variations but nothing works. If I hard code it like this (ALTER TABLE data_raw ADD IF NOT EXISTS PARTITION(dt="2017-10-31");) the partition gets created, but unfortunately it needs to be parameterized.

BTW – The error I receive is consistently like this: Error: Error while compiling statement: FAILED: ParseException line 1:48 cannot recognize input near '${DATE_PARTITION}' ')' '' in constant

I am probably close but not sure what I am missing.

TIA, Melissa

Upvotes: 2

Views: 2211

Answers (2)

Mikayla Konst
Mikayla Konst

Reputation: 111

Update: Dataproc now has workflow template parameterization, a beta feature:

https://cloud.google.com/dataproc/docs/concepts/workflows/workflow-parameters

For your specific case, you can do the following:

Create an empty template

gcloud beta dataproc workflow-templates create my-template

Add a job with a placeholder for the value you want to parameterize

gcloud beta dataproc workflow-templates add-job hive \
--step-id=0_first-job \
--workflow-template=my-template \
--file='gs://mybucket/first-job.sql' \
--params="DATE_PARTITION=PLACEHOLDER"

Export the template configuration to a file

gcloud beta dataproc workflow-templates export my-template \
--destination=hive-template.yaml

Edit the file to add a parameter

jobs:
- hiveJob:
    queryFileUri: gs://mybucket/first-job.sql
    scriptVariables:
      DATE_PARTITION: PLACEHOLDER
  stepId: 0_first-job
parameters:
- name: DATE_PARTITION
  fields:
  - jobs['0_first-job'].hiveJob.scriptVariables['DATE_PARTITION']

Import the changes

gcloud beta dataproc workflow-templates import my-template \
--source=hive-template.yaml

Add a managed cluster or cluster selector

gcloud beta dataproc workflow-templates set-managed-cluster my-template \
--cluster-name=my-cluster \
--zone=us-central1-a

Run your template with parameters

gcloud beta dataproc workflow-templates instantiate my-template \
--parameters="DATE_PARTITION=${DATE_PARTITION}"

Upvotes: 3

tix
tix

Reputation: 2158

Thanks for trying out Workflows! First-class support for parameterization is part of our roadmap. However for now your remove-job/add-job trick is the best way to go.

Regarding your specific question:

  • Values passed via params are accessed as ${hivevar:PARAM} (see [1]). Alternatively, you can set --properties which are accessed as ${PARAM}

  • The brackets around params are not needed. If it's intended to handle spaces in parameter values use quotations like: --params="FOO=a b c,BAR=X"

  • Finally, I noticed an errant space here DATE_PARTITION =$1 which probably results in empty DATE_PARTITION value

Hope this helps!

[1] How to use params/properties flag values when executing hive job on google dataproc

Upvotes: 1

Related Questions