rynop
rynop

Reputation: 53539

Hive on AWS: convert S3 JSON to Columnar preserving partitions

I have files in S3 that contain many lines of JSON (separated by newline). I want to convert these files to a Columnar Format for consumption by AWS Athena

I am following the Converting to Columnar Formats guide to do this, however when converted to ORC, the partition convention in S3 is lost.

In this example, how do you preserve the dt partition in the converted to parquet s3 folder structure? When I run the example it just outputs s3://myBucket/pq/000000_0 and NOT s3://myBucket/pq/dt=2009-04-14-04-05/000000_0

Here is the HQL that sets up interface to bring JSON into a Hive table:

CREATE EXTERNAL TABLE impressions (
  requestBeginTime string,
  adId string,
  impressionId string,
  referrer string,
  userAgent string,
  userCookie string,
  ip string,
  number string,
  processId string,
  browserCookie string,
  requestEndTime string,
  timers struct<modelLookup:string, requestTime:string>,
  threadId string,
  hostname string,
  sessionId string)
PARTITIONED BY (dt string)
ROW FORMAT  serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' )
LOCATION 's3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions' ;
msck repair table impressions;

Here is the HQL that converts to Parquet

CREATE EXTERNAL TABLE  parquet_hive (
  requestBeginTime string,
  adId string,
  impressionId string,
  referrer string,
  userAgent string,
  userCookie string,
  ip string)
STORED AS PARQUET
LOCATION 's3://mybucket/pq/';

INSERT OVERWRITE TABLE parquet_hive SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip FROM impressions where dt='2009-04-14-04-05';

Upvotes: 1

Views: 1208

Answers (2)

David דודו Markovitz
David דודו Markovitz

Reputation: 44931

First of all, Add PARTITIONED BY (dt string) to parquet_hive definition.

Second -
If you want to insert the data, partition by partition, you have to declare the partition you are inserting into. Note the PARTITION (dt='2009-04-14-04-05')

INSERT OVERWRITE TABLE parquet_hive PARTITION (dt='2009-04-14-04-05')
SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip 
FROM   impressions where dt='2009-04-14-04-05'
;

An easier way would be to use dynamic partitioning.
Note the PARTITION (dt) and the dt as a last column in in the SELECT.

You might need to to set hive.exec.dynamic.partition.mode.

set hive.exec.dynamic.partition.mode=nonstrict;

INSERT OVERWRITE TABLE parquet_hive PARTITION (dt)
SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip,dt 
FROM   impressions where dt='2009-04-14-04-05'
;

P.s.
CREATE EXTERNAL TABLE impressions does not "reads the JSON into a Hive table".
It is just an interface with the necessary information to read the HDFS files.
`

Upvotes: 3

John Rotenstein
John Rotenstein

Reputation: 269282

You can simply include the same PARTITIONED BY (dt string) parameter that is in your first statement, which will create the same directory structure.

In this case, the dt field (presumably, date) is actually stored in the directory name. A separate directory is created for each value.

Upvotes: 0

Related Questions