Trinh Thi Hai Yen
Trinh Thi Hai Yen

Reputation: 77

How to query AWS load balancer log if there are terabytes of logs?

I want to query AWS load balancer log to automatically and on schedule send report for me.

I am using Amazon Athena and AWS Lambda to trigger Athena. I created data table based on guide here: https://docs.aws.amazon.com/athena/latest/ug/application-load-balancer-logs.html

However, I encounter following issues:

Can you guys suggest for me some better solution to solve my problem? I am thinking of using ELK stack but I have no experience in working with ELK, can you show me the advantages and disadvantages of ELK compared to the combo: AWS Lambda + AWS Athena? Thank you!

Upvotes: 1

Views: 781

Answers (1)

Parsifal
Parsifal

Reputation: 4516

First off, you don't need to keep your Lambda running while the Athena query executes. StartQueryExecution returns a query identifier that you can then poll with GetQueryExecution to determine when the query finishes.

Of course, that doesn't work so well if you're invoking the query as part of a web request, but I recommend not doing that. And, unfortunately, I don't see that Athena is tied into CloudWatch Events, so you'll have to poll for query completion.

With that out of the way, the problem with reading access logs from Athena is that it isn't easy to partition them. The example that AWS provides defines the table inside Athena, and the default partitioning scheme uses S3 paths that have segments /column=value/. However, ALB access logs use a simpler yyyy/mm/dd partitioning Scheme.

If you use AWS Glue, you can define a table format that uses this simpler scheme. I haven't done that so can't give you information other than what's in the docs.

Another alternative is to limit the amount of data in your bucket. This can save on storage costs as well as reduce query times. I would do something like the following:

  • Bucket_A is the destination for access logs, and the source for your Athena queries. It has a life-cycle policy that deletes logs after 30 (or 45, or whatever) days.
  • Bucket_B is set up to replicate logs from Bucket_A (so that you retain everything, forever). It immediately transitions all replicated files to "infrequent access" storage, which cuts the cost in half.

Elasticsearch is certainly a popular option. You'll need to convert the files in order to upload it. I haven't looked, but I'm sure there's a Logstash plugin that will do so. Depending on what you're looking to do for reporting, Elasticsearch may be better or worse than Athena.

Upvotes: 1

Related Questions