Reputation: 2075
My company has millions of files in an S3 bucket, and every so often I have to search for files whose keys/paths contain some text. This is an extremely slow process because I have to iterate through all files.
I can't use prefix because the text of interest is not always at the beginning. I see other posts (here and here) that say this is a known limitation in S3's API. These posts are from over 3 years ago, so my first question is: does this limitation still exist?
Assuming the answer is yes, my next question is, given that I anticipate arbitrary regex-like searches over millions of S3 files, are there established best practices for workarounds? I've seen some people say that you can store the key names in a relational database, Elasticsearch, or a flat file. Are any of these approaches more common place than others?
Also, out of curiosity, why hasn't S3 supported such a basic use case in a service (S3) that is such an established core product of the overall AWS platform? I've noticed that GCS on Google Cloud has a similar limitation. Is it just really hard to do searches on key name strings well at scale?
Upvotes: 1
Views: 342
Reputation: 269666
You might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file containing a list of all objects in the bucket.
You could then load this file into a database, or even write a script to parse it. Or possibly even just play with it in Excel.
Upvotes: 1
Reputation: 17515
S3 is an object store, conceptually similar to a file system. I'd never try to make a database-like environment based on file names in a file system nor would I in S3.
Nevertheless, if this is what you have then I would start by running code to get all of the current file names into a database of some sort. DynamoDB cannot query by regular expression but any of PostgreSQL, MySQL, Aurora, and ElasticSearch can. So start with listing every file and put the file name and S3 location into a database-like structure. Then, create a Lambda that is notified of any changes (see this link for more info) that will do the appropriate thing with your backing store when a file is added or deleted.
Depending on your needs ElasticSearch is super flexible with queries and possibly better suited for these types of queries. But traditional relational database can be made to work too.
Lastly, you'll need an interface to the backing store to query. That will likely require some sort of server. That could be a simple as API gateway to a Lambda or something far more complex.
Upvotes: 1