Reputation: 14216
I am having trouble getting my Hive output to save to S3. I have tried ssh
ing into the master node and running my command in Hive but it does not save the output. I have also tried running the command in Hue from the EMR console in AWS and still it does not save to S3. I also added the script as a step and still it doesn't save. The only way I have been able to get the results is to run it in Hue and then click to see results and download that way, then push them to S3. I am clueless as to why this is happening. Here is the query that I am running.
with temp as (
select /*+ streamtable(l) */ a.id, a.name, a.page
from my_table a
join my_other_table l on (a.id = l.id)
group by a.page, a.id, a.name)
insert overwrite directory 's3://bucket/folder/folder2/folder3/folder4/folder5/folder6/folder7/'
select page, count(distinct id) over (PARTITION BY page)
from temp
group by page;
As a note, I would prefer for the solution to work when adding a step since I plan to add x
number of steps sequentially.
Upvotes: 1
Views: 2058
Reputation: 14216
I found the solution.
The problem was the trailing slash in the S3
location, the base path of the directory you wish to over write should NOT contain a trailing slash.
Upvotes: 1
Reputation: 269282
The normal way I've seen Amazon EMR output to Amazon S3 is to CREATE EXTERNAL TABLE
with a LOCATION
in Amazon S3.
For example:
CREATE EXTERNAL TABLE IF NOT EXISTS output_table
(gram string, year int, ratio double, increase double)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://my-bucket/directory';
Then, just INSERT
data into that table:
INSERT OVERWRITE TABLE output_table
SELECT gram FROM table...
Upvotes: 1