Reputation:
Hi I am trying to use S3 as a default file system when working with Hive. I have set up:
I can create databases and tables and they show up in my s3 bucket. Problem occurs when I try to use those tables - selecting from them, inserting. I get an error:
Incomplete HDFS URI, no host: hdfs:/tmp/hive/hadoop/dir/filename
Problem is that it still uses hdfs instead of s3, what else should I set up to make Hive and MapReduce use S3 as FS?
Upvotes: 1
Views: 975
Reputation: 269171
The easiest way to use hive and Amazon S3 is to launch and Amazon EMR cluster and use External Tables stored on S3.
For example, this statement creates a table that will be stored in S3:
CREATE EXTERNAL TABLE parquet_hive (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string
)
STORED AS PARQUET
LOCATION 's3://myBucket/myParquet/';
You could then insert data into it:
INSERT OVERWRITE TABLE parquet_hive
SELECT
requestbegintime,
adid,
impressionid,
referrer,
useragent,
usercookie,
ip
FROM impressions;
See: Converting to Columnar Formats
If you are using your own Hadoop cluster instead of Amazon S3, you might need some additional configuration to work with S3 (eg using s3n:
or s3a:
).
Upvotes: 1