Subbu
Subbu

Reputation: 663

Identifying and deleting S3 Objects that are not being accessed?

I have recently joined a company that uses S3 Buckets for various different projects within AWS. I want to identify and potentially delete S3 Objects that are not being accessed (read and write), in an effort to reduce the cost of S3 in my AWS account.

I read this, which helped me to some extent.

Is there a way to find out which objects are being accessed and which are not?

Upvotes: 9

Views: 10754

Answers (3)

Aniket Kulkarni
Aniket Kulkarni

Reputation: 12983

There is recent AWS blog post which I found very interesting and cost optimized approach to solve this problem.

Here is the description from AWS blog:

  1. The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.

  2. An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.

  3. An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.

  4. The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:

  • Capture the number of days (x) configuration from the S3 Lifecycle configuration.
  • Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than 'x' days, but not accessed during that time.
  • Write a manifest file with the list of these objects to an S3 bucket.
  • Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of "delete=True".
  1. The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to 'x' days. They will have the tag given via the S3 batch operation of "delete=True".

Upvotes: 1

Matt D
Matt D

Reputation: 3496

There is no native way of doing this at the moment, so all the options are workarounds depending on your usecase.

You have a few options:

  1. Tag each S3 Object (e.g. 2018-10-24). First turn on Object Level Logging for your S3 bucket. Set up CloudWatch Events for CloudTrail. The Tag could then be updated by a Lambda Function which runs on a CloudWatch Event, which is fired on a Get event. Then create a function that runs on a Scheduled CloudWatch Event to delete all objects with a date tag prior to today.
  2. Query CloudTrail logs on, write a custom function to query the last access times from Object Level CloudTrail Logs. This could be done with Athena, or a direct query to S3.
  3. Create a Separate Index, in something like DynamoDB, which you update in your application on read activities.
  4. Use a Lifecycle Policy on the S3 Bucket / key prefix to archive or delete the objects after x days. This is based on upload time rather than last access time, so you could copy the object to itself to reset the timestamp and start the clock again.

Upvotes: 5

John Rotenstein
John Rotenstein

Reputation: 269520

No objects in Amazon S3 are required by other AWS services, but you might have configured services to use the files.

For example, you might be serving content through Amazon CloudFront, providing templates for AWS CloudFormation or transcoding videos that are stored in Amazon S3.

If you didn't create the files and you aren't knowingly using the files, can you probably delete them. But you would be the only person who would know whether they are necessary.

Upvotes: 2

Related Questions