Ashan
Ashan

Reputation: 19728

Query S3 logs content using Athena or DynamoDB

I have a use case to query request url from S3 logs. Amazon has recently introduced Athena to query S3 file contents. What is the best option with respect to cost and performance?

  1. Use Athena to query S3 files for request urls
  2. Store metadata of each file with request url information in DynamoDB table for query

Upvotes: 5

Views: 6545

Answers (3)

Deepak Singhal
Deepak Singhal

Reputation: 10864

Athena vs. DynamoDB: If functionally you can achieve your requirement with both ; then:

  1. DynamoDB will be many times faster than Athena.
  2. DynamoDB will be more expensive than Athena. In DynamoDB you pay cost on provisioned IOPS; while in Athena you pay ONLY when you query ( else you pay only s3 storage cost).

Hence, if you need to rarely query on your data Athena would be a better solution else DynamoDB. Also, if performance is important DynamoDB is the answer. Also, if you already have TBs of data in S3; then Athena is a solution as why would u load it into DynamoDB which would cost a bomb ( until and unless u want query results in milli seconds or seconds )..

Upvotes: 4

John Rotenstein
John Rotenstein

Reputation: 269340

Amazon DynamoDB would be a poor choice for running queries over web logs.

DynamoDB is super-fast, but only if you are retrieving data based upon its Primary Key ("Query"). If you are running a query against ALL data in a table (eg to find a particular IP address in a Key that is not indexed), DynamoDB will need to scan through ALL rows in the table, which takes a lot of time ("Scan"). For example, if your table is configured for 100 Reads per Second and you are scanning 10000 rows, it will take 100 seconds (100 x 100 = 10000).

Tip: Do not do full-table scans in a NoSQL database.

Amazon Athena is ideal for scanning log files! There is no need to pre-load data - simply run the query against the logs already stored in Amazon S3. Use standard SQL to find the data you're seeking. Plus, you only pay for the data that is read from disk. The file format is a bit weird, so you'll need the correct CREATE TABLE statement.

See: Using AWS Athena to query S3 Server Access Logs

Another choice is to use Amazon Redshift, which can GBs, TBs and even PBs of data across billions of rows. If you are going to run frequent queries against the log data, Redshift is great. However, being a standard SQL database, you will need to pre-load the data into Redshift. Unfortunately, Amazon S3 log files are not in CSV format, so you would need to ETL the files into a suitable format. This isn't worthwhile for occasional, ad-hoc requests.

Many people also like to use Amazon Elasticsearch Service for scanning log files. Again, the file format needs some special handling and the pipeline to load the data needs some work, but the result is near-realtime interactive analysis of your S3 log files.

See: Using the ELK stack to analyze your S3 logs

Upvotes: 11

As Deepak mentioned, DynamoDB is faster, but cost is higher than Athena. Depending on your use case, implementing the solution with hybrid approach might give you good results in certain scenarios.

You can use DynamoDB to store recent, and read heavy data. Old, read inexpensive data can be stored in S3 and use Athena to query over it.

However, implementation wise this will be bit complex.

Upvotes: 0

Related Questions