user4038636
user4038636

Reputation:

Using S3 as default file system

Hi I am trying to use S3 as a default file system when working with Hive. I have set up:

I can create databases and tables and they show up in my s3 bucket. Problem occurs when I try to use those tables - selecting from them, inserting. I get an error:

Incomplete HDFS URI, no host: hdfs:/tmp/hive/hadoop/dir/filename

Problem is that it still uses hdfs instead of s3, what else should I set up to make Hive and MapReduce use S3 as FS?

Upvotes: 1

Views: 975

Answers (1)

John Rotenstein
John Rotenstein

Reputation: 269171

The easiest way to use hive and Amazon S3 is to launch and Amazon EMR cluster and use External Tables stored on S3.

For example, this statement creates a table that will be stored in S3:

CREATE EXTERNAL TABLE  parquet_hive (
    requestBeginTime string,
    adId string,
    impressionId string,
    referrer string,
    userAgent string,
    userCookie string,
    ip string
)
STORED AS PARQUET
LOCATION 's3://myBucket/myParquet/';

You could then insert data into it:

INSERT OVERWRITE TABLE parquet_hive
SELECT
  requestbegintime,
  adid,
  impressionid,
  referrer,
  useragent,
  usercookie,
  ip
FROM impressions;

See: Converting to Columnar Formats

If you are using your own Hadoop cluster instead of Amazon S3, you might need some additional configuration to work with S3 (eg using s3n: or s3a:).

Upvotes: 1

Related Questions