Reputation: 2677
I used this link to successfully query classic load balancing logs, however as the bucket has increased in size I want to partition my table. Unfortunately I can't get it to work and would appreciate some advice.
The following is an attempt to create a partitioned table but it is created but returns 0 rows:
CREATE EXTERNAL TABLE `elb_logs_part`(
`timestamp` string COMMENT '',
`elb_name` string COMMENT '',
`request_ip` string COMMENT '',
`request_port` int COMMENT '',
`backend_ip` string COMMENT '',
`backend_port` int COMMENT '',
`request_processing_time` double COMMENT '',
`backend_processing_time` double COMMENT '',
`client_response_time` double COMMENT '',
`elb_response_code` string COMMENT '',
`backend_response_code` string COMMENT '',
`received_bytes` bigint COMMENT '',
`sent_bytes` bigint COMMENT '',
`request_verb` string COMMENT '',
`url` string COMMENT '',
`protocol` string COMMENT '',
`user_agent` string COMMENT '',
`ssl_cipher` string COMMENT '',
`ssl_protocol` string COMMENT '')
PARTITIONED BY(year string, month string, day string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucketname/AWSLogs/XXXXXXXX/elasticloadbalancing'
TBLPROPERTIES (
'transient_lastDdlTime'='1555268331')
I have tried editing to add the region to the end of the location (this seems logical but the above was created out of the create query athena produced so I went with it).
I have tried to alter the existing table as well using:
ALTER TABLE elb_logs.elb_logs ADD
PARTITION(year='2019' month = '11', day = '18') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
PARTITION(year='2019' month = '11', day = '17') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
PARTITION(year='2019' month = '11', day = '16') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
Unfortunately this produces the error:
line 2:4: missing 'column' at 'partition'"
which I don't understand as the above is taken directly from documentation. I imagine it is caused by the partitions not being defined or something...!?
Sorry all new to me, can anyone help me to partition a classic load balancer logs stored in s3 using Athena?
Really need to spot the buggers who keep scraping my sites and I think unintentionally taking us offline!
Upvotes: 0
Views: 414
Reputation: 2677
For me to partition the data I needed I had to take 2 steps, firstly create a new partitioned table but pointing to the wrong location, then add the correct partitions.
Step 1: Create partitioned table (note the location is actually not ideal, excluding the region)
CREATE EXTERNAL TABLE `elb_logs_part`(
`timestamp` string COMMENT '',
`elb_name` string COMMENT '',
`request_ip` string COMMENT '',
`request_port` int COMMENT '',
`backend_ip` string COMMENT '',
`backend_port` int COMMENT '',
`request_processing_time` double COMMENT '',
`backend_processing_time` double COMMENT '',
`client_response_time` double COMMENT '',
`elb_response_code` string COMMENT '',
`backend_response_code` string COMMENT '',
`received_bytes` bigint COMMENT '',
`sent_bytes` bigint COMMENT '',
`request_verb` string COMMENT '',
`url` string COMMENT '',
`protocol` string COMMENT '',
`user_agent` string COMMENT '',
`ssl_cipher` string COMMENT '',
`ssl_protocol` string COMMENT '')
PARTITIONED BY (year string, month string, day string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucketname/AWSLogs/XXXXXXXXX/elasticloadbalancing'
TBLPROPERTIES (
'transient_lastDdlTime'='1555268331')
Step 2: Partition the data as below (thanks to Ilya for spotting the typo)
ALTER TABLE elb_logs.elb_logs_part ADD
PARTITION(year='2019', month = '11', day = '18') location 's3://bucketname/AWSLogs/XXXXXXXXX/elasticloadbalancing/region/2019/11/18'
PARTITION(year='2019', month = '11', day = '17') location 's3://bucketname/AWSLogs/XXXXXXXXX/elasticloadbalancing/region/2019/11/17'
PARTITION(year='2019', month = '11', day = '16') location 's3://bucketname/AWSLogs/XXXXXXXXX/elasticloadbalancing/region/2019/11/16'
This worked for me.
Upvotes: 0
Reputation: 2668
You are missing comma (,
) after you specify year
. The following statement resulted in Query successful.
ALTER TABLE elb_logs.elb_logs ADD
PARTITION(year='2019', month = '11', day = '18') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
PARTITION(year='2019', month = '11', day = '17') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
PARTITION(year='2019', month = '11', day = '16') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
Upvotes: 1