gold_cy
gold_cy

Reputation: 14216

Hive results not being saved into S3 bucket

I am having trouble getting my Hive output to save to S3. I have tried sshing into the master node and running my command in Hive but it does not save the output. I have also tried running the command in Hue from the EMR console in AWS and still it does not save to S3. I also added the script as a step and still it doesn't save. The only way I have been able to get the results is to run it in Hue and then click to see results and download that way, then push them to S3. I am clueless as to why this is happening. Here is the query that I am running.

with temp as (
select /*+ streamtable(l) */ a.id, a.name, a.page
from my_table a
join my_other_table l on (a.id = l.id)
group by a.page, a.id, a.name)
insert overwrite directory 's3://bucket/folder/folder2/folder3/folder4/folder5/folder6/folder7/'
select page, count(distinct id) over (PARTITION BY page)
from temp
group by page;

As a note, I would prefer for the solution to work when adding a step since I plan to add x number of steps sequentially.

Upvotes: 1

Views: 2058

Answers (2)

gold_cy
gold_cy

Reputation: 14216

I found the solution.

The problem was the trailing slash in the S3 location, the base path of the directory you wish to over write should NOT contain a trailing slash.

Upvotes: 1

John Rotenstein
John Rotenstein

Reputation: 269282

The normal way I've seen Amazon EMR output to Amazon S3 is to CREATE EXTERNAL TABLE with a LOCATION in Amazon S3.

For example:

CREATE EXTERNAL TABLE IF NOT EXISTS output_table
(gram string, year int, ratio double, increase double)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://my-bucket/directory';

Then, just INSERT data into that table:

INSERT OVERWRITE TABLE output_table
SELECT gram FROM table...

Upvotes: 1

Related Questions