Diego Serrano
Diego Serrano

Reputation: 1016

Pass Hive parameters to EMR Step

I am trying to use EMR to run a query on an EXTERNAL table partitioned by date, where the dt partition has the format YYYYmmdd i.e: 20190121.

CREATE EXTERNAL TABLE `my_schema`.`tracking_table`(
  `id` string,
  `active_bitmap` string)
PARTITIONED BY (
  `dt` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'quoteChar'='\"',
  'separatorChar'='\t')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://bucket/trackingtable'

I created a simple script that inserts the results, separated by tabs and compressed in gzip, into my S3 bucket.

set hive.cli.print.header=true;
set mapred.output.compress=true;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

INSERT OVERWRITE DIRECTORY '${OUTPUT}/dt={:start_date}/'
select if(b.id is null,a.id,b.id) as id
       ,if(b.days_active is null, 1, (shiftleft(CAST(b.days_active AS BIGINT),1))|if(a.is_active is null,0,1) ) as active_bitmap
       ,'{:start_date}' as dt_partition
from(
    select id,
    if(count(1) > 0, 1, NULL) as is_active
    from my_schema.activity_table where dt='{:start_date}' group by id
)a
full outer join(
    select * from my_schema.tracking_table where dt='{:start_date-1}'
)b on a.id=b.id;

I tested my script on the HIVE console by replacing the ${OUTPUT}, {:start_date} and {:start_date-1} parameteres with values and it works fine, I can see the results compressed and separated by tabs in my S3 output bucket.

Now, I want to run this script programmatically for the last year data. How can I pass the date parameter to my EMR Step? I see there's an arguments section on EMR but I am guessing that's for configuration parameters of the EMR cluster.

Also, will the {:start_date-1} work for my date format or do I need to parse the string date as a date, substract a day and parse it as a string again?

I was planning on creating a python script that takes a range of dates and submits each step to a long running EMR cluster, but I don't know how to pass the date as a parameter and I can't find any tutorials on how to do this easily.

Upvotes: 0

Views: 1861

Answers (1)

Diego Serrano
Diego Serrano

Reputation: 1016

To pass parameters to an EMR HIVE job either:

In the EMR Step Web Console

Add

-d parameter1 -d parameter2

i.e

-d dt=20190101 -d dt2=20190201

In the EMR Step arguments section

Or in the AWS cli, add the parameters in the ARGS section. i.e:

aws emr add-steps --cluster-id j-xxxxx --steps Type=HIVE,Name='Hive Job Name XXX',ActionOnFailure=CONTINUE,Args=[-f,s3://bucket/folder/,-d,INPUT=s3://bucket/folder/input/,-d,OUTPUT=s3://s3://bucket/folder/output/,-d,dt=20190101,-d,dt2=20190201]

More:

https://docs.aws.amazon.com/cli/latest/reference/emr/add-steps.html

Upvotes: 0

Related Questions