n34_panda
n34_panda

Reputation: 2677

Athena classic load balancer logs

I used this link to successfully query classic load balancing logs, however as the bucket has increased in size I want to partition my table. Unfortunately I can't get it to work and would appreciate some advice.

The following is an attempt to create a partitioned table but it is created but returns 0 rows:

CREATE EXTERNAL TABLE `elb_logs_part`( 
`timestamp` string COMMENT '', 
  `elb_name` string COMMENT '', 
  `request_ip` string COMMENT '', 
  `request_port` int COMMENT '', 
  `backend_ip` string COMMENT '', 
  `backend_port` int COMMENT '', 
  `request_processing_time` double COMMENT '', 
  `backend_processing_time` double COMMENT '', 
  `client_response_time` double COMMENT '', 
  `elb_response_code` string COMMENT '', 
  `backend_response_code` string COMMENT '', 
  `received_bytes` bigint COMMENT '', 
  `sent_bytes` bigint COMMENT '', 
  `request_verb` string COMMENT '', 
  `url` string COMMENT '', 
  `protocol` string COMMENT '', 
  `user_agent` string COMMENT '', 
  `ssl_cipher` string COMMENT '', 
  `ssl_protocol` string COMMENT '')
PARTITIONED BY(year string, month string, day string)

ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.RegexSerDe' 

WITH SERDEPROPERTIES ( 
  'input.regex'='([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://bucketname/AWSLogs/XXXXXXXX/elasticloadbalancing'
TBLPROPERTIES (
  'transient_lastDdlTime'='1555268331')

I have tried editing to add the region to the end of the location (this seems logical but the above was created out of the create query athena produced so I went with it).

I have tried to alter the existing table as well using:

ALTER TABLE elb_logs.elb_logs ADD 
   PARTITION(year='2019' month = '11', day = '18') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
   PARTITION(year='2019' month = '11', day = '17') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
   PARTITION(year='2019' month = '11', day = '16') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'

Unfortunately this produces the error:

line 2:4: missing 'column' at 'partition'"

which I don't understand as the above is taken directly from documentation. I imagine it is caused by the partitions not being defined or something...!?

Sorry all new to me, can anyone help me to partition a classic load balancer logs stored in s3 using Athena?

Really need to spot the buggers who keep scraping my sites and I think unintentionally taking us offline!

Upvotes: 0

Views: 414

Answers (2)

n34_panda
n34_panda

Reputation: 2677

For me to partition the data I needed I had to take 2 steps, firstly create a new partitioned table but pointing to the wrong location, then add the correct partitions.

Step 1: Create partitioned table (note the location is actually not ideal, excluding the region)

CREATE EXTERNAL TABLE `elb_logs_part`(
  `timestamp` string COMMENT '', 
  `elb_name` string COMMENT '', 
  `request_ip` string COMMENT '', 
  `request_port` int COMMENT '', 
  `backend_ip` string COMMENT '', 
  `backend_port` int COMMENT '', 
  `request_processing_time` double COMMENT '', 
  `backend_processing_time` double COMMENT '', 
  `client_response_time` double COMMENT '', 
  `elb_response_code` string COMMENT '', 
  `backend_response_code` string COMMENT '', 
  `received_bytes` bigint COMMENT '', 
  `sent_bytes` bigint COMMENT '', 
  `request_verb` string COMMENT '', 
  `url` string COMMENT '', 
  `protocol` string COMMENT '', 
  `user_agent` string COMMENT '', 
  `ssl_cipher` string COMMENT '', 
  `ssl_protocol` string COMMENT '')
  PARTITIONED BY (year string, month string, day string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ( 
  'input.regex'='([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://bucketname/AWSLogs/XXXXXXXXX/elasticloadbalancing'
TBLPROPERTIES (
  'transient_lastDdlTime'='1555268331')

Step 2: Partition the data as below (thanks to Ilya for spotting the typo)

ALTER TABLE elb_logs.elb_logs_part ADD 
   PARTITION(year='2019', month = '11', day = '18') location 's3://bucketname/AWSLogs/XXXXXXXXX/elasticloadbalancing/region/2019/11/18'
   PARTITION(year='2019', month = '11', day = '17') location 's3://bucketname/AWSLogs/XXXXXXXXX/elasticloadbalancing/region/2019/11/17'
   PARTITION(year='2019', month = '11', day = '16') location 's3://bucketname/AWSLogs/XXXXXXXXX/elasticloadbalancing/region/2019/11/16'

This worked for me.

Upvotes: 0

Ilya Kisil
Ilya Kisil

Reputation: 2668

You are missing comma (,) after you specify year. The following statement resulted in Query successful.

ALTER TABLE elb_logs.elb_logs ADD 
   PARTITION(year='2019', month = '11', day = '18') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
   PARTITION(year='2019', month = '11', day = '17') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'
   PARTITION(year='2019', month = '11', day = '16') location 's3://buckets/bucketname/AWSLogs/XXXXXXXXXX/elasticloadbalancing/eu-west-1/2019/11/17'

Upvotes: 1

Related Questions