Reputation: 25
I need to load an s3 data in hive table. This s3 location is dynamic and is stored in another static s3 location.
The dynamic s3 location which I want to load in hive table has path format
s3://s3BucketName/some-path/yyyy-MM-dd
and the static location has data format
{"datasetDate": "datePublished", "s3Location": "s3://s3BucketName/some-path/yyyy-MM-dd"}
Is there a way to read this data in hive? I searched about this a lot but could not find anything.
Upvotes: 1
Views: 1070
Reputation: 38335
You can read JSON data from your static location file, parse s3Location
field and pass it as a parameter to your add partition clause.
Possible way to read json is using Hive. You can use some other means for the same.
Example using Hive.
create table data_location(location_info string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://s3BucketName/some-static-location-path/';
Then get the location in the shell script and pass it as a parameter to ADD partition statement.
For example you have table named target_table
partitioned by datePublished
. You can add partitions like this:
#!/bin/bash
data_location=$(hive -e "set hive.cli.print.header=false; select get_json_object(location_info,'$.s3Location') from data_location")
#get partition name
partition=$(basename ${data_location})
#Create partition in your target table:
hive -e "ALTER TABLE TARGET_TABLE ADD IF NOT EXISTS PARTITION (datePublished='${partition}') LOCATION '${data_location}'"
If you do not want partitioned table, then you can use ALTER TABLE SET LOCATION instead of adding partition:
hive -e "ALTER TABLE TARGET_TABLE SET LOCATION='${data_location}'"
If it is only the last subfolder name is dynamic (which is date) and base directory is always the same, like s3://s3BucketName/some-path/, only yyyy-MM-dd is changing, you can create table once with location s3://s3BucketName/some-path/
and issue RECOVER PARTITIONS statement. In this case you do not need to read the content of file with location specification. Just schedule RECOVER PARTITIONS
to get new partition attached on daily basis.
Upvotes: 1