AWS Glue - How to configure Crawler to create a table with S3 filenames inside

What I need to accomplish: Full text Search on S3 KEYS (not files content)

So, my S3 bucket has this structure:

bucket_name/
  [email protected]/
    folder1/
      notthatimportantfile
      ** any folders or files here

  [email protected]/
    folderN/** any folders or files here
    folderN+1/notimportantfile
    folderX/mysuperimportantfile

And on my JS app, I want to put a search box, in which if I type the word "important", I get back all the files with that word in their names

I'm not interested in the contents of those files. So I've searched the Internet how to do this and haven't found anything.

The last thing I tried was to use Amazon Glue and Athena but when I create a Crawler and run it inside Glue, it creates one table per file, and what I want is to create one table per first level folder with the files in it. For example, the structure above would create 2 tables on the database: - [email protected] - [email protected]

And on those tables, their rows would be the files inside those folders in S3:

  TABLE "[email protected]"
     _id    filename                      size
     01234  folder1/notthatimportantfile  1024
     56789  folder1/anotherfile_here.txt  768

Having that, I would easily use AWS Athena to query the data:

SELECT * from '[email protected]' WHERE filename LIKE %important%

So, is there a way to accomplish this? am I crazy? Is there an easier way? Maybe AWS CloudSearch?

NOTE for non-programmers: Please don't say "Just go to the AWS console and put your keyword on the searchbox". I'm sick of those answers

Upvotes: 0

Views: 2488

Answers (1)

Theo
Theo

Reputation: 132972

There is a way to do it with Athena, but I can't recommend it. I've included it below.

Depending on your application there are different ways to do it. If the files don't change often and you don't mind returning results that are a few hours old I would suggest using S3 Inventory, it periodically writes a listing of all the objects in a bucket. To search for a file you look up the latest inventory manifest, find the inventory listings, and scan through them (you can even use Athena to query the inventory, there is a guide for that in the documentation).

If you need to be more up to date S3 Inventory won't work since it will only produce a new listing at most once per day.

If that is the case you can either simply list all of the files in your bucket for every search, depending on the number of files we're talking about it can be very slow or very fast – less than 1000 files will be very fast, but after that you will have to do multiple requests paging through the objects 1000 at a time, or figure out a way to do multiple parallel paging operations by listing all objects starting with a in one operation, all starting with b in another, and so on.

The most cost efficient solution if you want to be up to date and have quick responses is to keep a separate index of the objects in your bucket. Update the index by listening for S3 Event Notifications. Listen for both creations and deletions and update the index accordingly. You can use ElasticSearch, Algolia, or Postgres or any other service or database with decent full text capabilities for the index – but it would be a lot more infrastructure to set up than just S3 of course.


If you absolutely want to do it with Athena, it can be done, but it won't be the most efficient solution, and if you have a lot of files and they contain a lot of data this might end up costing you a bit, so be warned.

There is a hidden column in all Athena tables called $path that contains the full URI of the S3 object that a row was read from. It can be used for filtering too.

If you create a table like this:

CREATE EXTERNAL TABLE `listing` (`ignored` string)
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucket_name/'
TBLPROPERTIES ('has_encrypted_data'='false')

it's just a table with a dummy column, it can be named anything, it's just there because it's not possible to create a table without columns. Just make sure the LOCATION is correct.

To search for objects this is what you do:

SELECT DISTINCT "$path"
FROM listing
WHERE "$path" LIKE '%foo%'

While this may look neat and solve your problem, it's a terribly inefficient and expensive way to do it. Athena will list all the files in your bucket, then read all of them, and you will have to pay for both the S3 operations and for the data scanned. Even if Athena is in the end only looking at the S3 keys, it doesn't have any optimization that stops it from reading them all.

Upvotes: 4

Related Questions