data princess
data princess

Reputation: 1158

Using bash to send hive script a variable number of fields

I'm automating a data pipeline by using a bash script to move csvs to HDFS and build external Hive tables on them. Currently, this only works when the format of the table is predefined in an .hql file. But I want to be able to read the headers from the CSV and send them as arguments to Hive. So currently I do this inside a loop through the files:

# bash
hive -S -hiveconf VAR1=$target_db -hiveconf VAR2=$filename -hiveconf VAR3=$target_folder/$filename -f create_tables.hql 

Which is sent to this...

-- hive
CREATE DATABASE IF NOT EXISTS ${hiveconf:VAR1};         
CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:VAR1}.${hiveconf:VAR2}(  
  individual_pkey INT,      
  response CHAR(1)        
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','   
LOCATION '/${hiveconf:VAR3}' 

I want the hive script to look more like this...

CREATE DATABASE IF NOT EXISTS ${hiveconf:VAR1};         
CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:VAR1}.${hiveconf:VAR2}(  
  ${hiveconf:ROW1} ${hiveconf:TYPE1},
  ...              ...
  ${hiveconf:ROW_N} ${hiveconf:TYPE_N}

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','   
LOCATION '/${hiveconf:VAR3}'

Is it possible to send it some kind of array that it would parse? Is this feasible or advisable?

Upvotes: 0

Views: 445

Answers (1)

data princess
data princess

Reputation: 1158

I eventually figured out a way around this.

You can't really write an HQL script that takes in a variable number of fields. You can, however, write a bash script that generates an HQL script of variable length. I've implemented this for my team, but the general idea is to write out how you want the HQL to look as a string in bash, then use something like Rscript to read in and identify the data types of your CSV. Store the data types as an array along with the CSV headers and then loop through those arrays, writing the information to the HQL.

Upvotes: 1

Related Questions