Reputation: 53539
I have files in S3 that contain many lines of JSON (separated by newline). I want to convert these files to a Columnar Format for consumption by AWS Athena
I am following the Converting to Columnar Formats guide to do this, however when converted to ORC, the partition convention in S3 is lost.
In this example, how do you preserve the dt
partition in the converted to parquet s3 folder structure? When I run the example it just outputs s3://myBucket/pq/000000_0
and NOT s3://myBucket/pq/dt=2009-04-14-04-05/000000_0
Here is the HQL that sets up interface to bring JSON into a Hive table:
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string,
number string,
processId string,
browserCookie string,
requestEndTime string,
timers struct<modelLookup:string, requestTime:string>,
threadId string,
hostname string,
sessionId string)
PARTITIONED BY (dt string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' )
LOCATION 's3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions' ;
msck repair table impressions;
Here is the HQL that converts to Parquet
CREATE EXTERNAL TABLE parquet_hive (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string)
STORED AS PARQUET
LOCATION 's3://mybucket/pq/';
INSERT OVERWRITE TABLE parquet_hive SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip FROM impressions where dt='2009-04-14-04-05';
Upvotes: 1
Views: 1208
Reputation: 44931
First of all, Add PARTITIONED BY (dt string)
to parquet_hive
definition.
Second -
If you want to insert the data, partition by partition, you have to declare the partition you are inserting into.
Note the PARTITION (dt='2009-04-14-04-05')
INSERT OVERWRITE TABLE parquet_hive PARTITION (dt='2009-04-14-04-05')
SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip
FROM impressions where dt='2009-04-14-04-05'
;
An easier way would be to use dynamic partitioning.
Note the PARTITION (dt)
and the dt
as a last column in in the SELECT.
You might need to to set hive.exec.dynamic.partition.mode.
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE parquet_hive PARTITION (dt)
SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip,dt
FROM impressions where dt='2009-04-14-04-05'
;
P.s.
CREATE EXTERNAL TABLE impressions
does not "reads the JSON into a Hive table".
It is just an interface with the necessary information to read the HDFS files.
`
Upvotes: 3
Reputation: 269282
You can simply include the same PARTITIONED BY (dt string)
parameter that is in your first statement, which will create the same directory structure.
In this case, the dt
field (presumably, date) is actually stored in the directory name. A separate directory is created for each value.
Upvotes: 0