Load data in hive table stored in s3 whose location is stored in another static s3 location

Question

I need to load an s3 data in hive table. This s3 location is dynamic and is stored in another static s3 location.

The dynamic s3 location which I want to load in hive table has path format

s3://s3BucketName/some-path/yyyy-MM-dd

and the static location has data format

{"datasetDate": "datePublished", "s3Location": "s3://s3BucketName/some-path/yyyy-MM-dd"}

Is there a way to read this data in hive? I searched about this a lot but could not find anything.

leftjoin · Accepted Answer

You can read JSON data from your static location file, parse s3Location field and pass it as a parameter to your add partition clause.

Possible way to read json is using Hive. You can use some other means for the same.

Example using Hive.

create table data_location(location_info string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY  '	' LINES TERMINATED BY '
' 
STORED AS TEXTFILE
    LOCATION 's3://s3BucketName/some-static-location-path/';

Then get the location in the shell script and pass it as a parameter to ADD partition statement.

For example you have table named target_table partitioned by datePublished. You can add partitions like this:

#!/bin/bash

data_location=$(hive -e "set hive.cli.print.header=false; select get_json_object(location_info,'$.s3Location') from data_location")

#get partition name
partition=$(basename ${data_location})

#Create partition in your target table:
hive -e "ALTER TABLE TARGET_TABLE ADD IF NOT EXISTS PARTITION (datePublished='${partition}') LOCATION '${data_location}'"

If you do not want partitioned table, then you can use ALTER TABLE SET LOCATION instead of adding partition:

hive -e "ALTER TABLE TARGET_TABLE SET LOCATION='${data_location}'"

If it is only the last subfolder name is dynamic (which is date) and base directory is always the same, like s3://s3BucketName/some-path/, only yyyy-MM-dd is changing, you can create table once with location s3://s3BucketName/some-path/ and issue RECOVER PARTITIONS statement. In this case you do not need to read the content of file with location specification. Just schedule RECOVER PARTITIONS to get new partition attached on daily basis.

Load data in hive table stored in s3 whose location is stored in another static s3 location

Answers (1)

Related Questions