Hitu Bansal
Hitu Bansal

Reputation: 3137

How do you full text search an Amazon S3 bucket?

I have a bucket on S3 in which I have large amount of text files.

I want to search for some text within a text file. It contains raw data only. And each text file has a different name.

For example, I have a bucket name:

abc/myfolder/abac.txt

xyx/myfolder1/axc.txt

& I want to search text like "I am human" in the above text files.

How to achieve this? Is it even possible?

Upvotes: 36

Views: 26806

Answers (8)

MBTea
MBTea

Reputation: 1

You may consider using Amazon Kendra, it helps you index files within multiple data sources. Although the cost could be relatively high for managed service. It also provide semantic search and exact text search.

Upvotes: 0

Sankalp Garg
Sankalp Garg

Reputation: 55

Amazon S3 does not provide searching capability in S3. As mentioned Kendra and Elasticsearch can be integrated

Now below part does not directly solve the question but we are building semantic search +keyword search for documents in S3. Our project is open source https://github.com/AvalokAI/AvalokAI We will also have a hosted version and will be adding support for image search as well and potentially RAG based QA. I would love to hear feedback on this and also on your requirements and the problems you have faced in search.

PS: I am the founder of Avalok AI.

Upvotes: 0

chendu
chendu

Reputation: 829

there is serverless and cheaper option available

  1. Use AWS Glue and you can convert the txt fils into a table
  2. use AWS AThena and you can run sql queries on top of it.

I wouldrecommend you to put data in parquets on s3 and this makes the data size on s3 very small and super fast!

Upvotes: 0

Ethan
Ethan

Reputation: 41

Although not an AWS native service, there is Mixpeek, which runs text extraction like Tika, Tesseract and ImageAI on your S3 files then places them in a Lucene index to make them searchable.

You integrate it as follows:

  1. Download the module: https://github.com/mixpeek/mixpeek-python

  2. Import the module and your API keys:

     from mixpeek import Mixpeek, S3
     from config import mixpeek_api_key, aws
    
  3. Instantiate the S3 class (which uses boto3 and requests):

     s3 = S3(
         aws_access_key_id=aws['aws_access_key_id'],
         aws_secret_access_key=aws['aws_secret_access_key'],
         region_name='us-east-2',
         mixpeek_api_key=mixpeek_api_key
     )
    
  4. Upload one or more existing S3 files:

         # upload all S3 files in bucket "demo"            
         s3.upload_all(bucket_name="demo")
    
         # upload one single file called "prescription.pdf" in bucket "demo"
         s3.upload_one(s3_file_name="prescription.pdf", bucket_name="demo")
    
  5. Now simply search using the Mixpeek module:

         # mixpeek api direct
         mix = Mixpeek(
             api_key=mixpeek_api_key
         )
         # search
         result = mix.search(query="Heartgard")
         print(result)
    
  6. Where result can be:

     [
         {
             "_id": "REDACTED",
             "api_key": "REDACTED",
             "highlights": [
                 {
                     "path": "document_str",
                     "score": 0.8759502172470093,
                     "texts": [
                         {
                             "type": "text",
                             "value": "Vetco Prescription\nVetcoClinics.com\n\nCustomer:\n\nAddress: Canine\n\nPhone: Australian Shepherd\n\nDate of Service: 2 Years 8 Months\n\nPrescription\nExpiration Date:\n\nWeight: 41.75\n\nSex: Female\n\n℞  "
                         },
                         {
                             "type": "hit",
                             "value": "Heartgard"
                         },
                         {
                             "type": "text",
                             "value": " Plus Green 26-50 lbs (Ivermectin 135 mcg/Pyrantel 114 mg)\n\nInstructions: Give one chewable tablet by mouth once monthly for protection against heartworms, and the treatment and\ncontrol of roundworms, and hookworms. "
                         }
                     ]
                 }
             ],
             "metadata": {
                 "date_inserted": "2021-10-07 03:19:23.632000",
                 "filename": "prescription.pdf"
             },
             "score": 0.13313256204128265
         }
     ] 
    

Then you parse the results

Upvotes: 4

Sachin Sukumaran
Sachin Sukumaran

Reputation: 715

If you have an EMR, then create a spark application and do a search . We did this. This will work as distributed searcn

Upvotes: 0

Mickael Kerjean
Mickael Kerjean

Reputation: 129

You can use Filestash (Disclaimer: I'm the author), install you own instance and connect to your S3 bucket. Eventually give it a bit of time to index the entire thing if you have a whole lot of data and you should be good

Upvotes: 0

Frederic Henri
Frederic Henri

Reputation: 53713

Since october 1st, 2015 Amazon offers another search service with Elastic Search, in more or less the same vein as cloud search you can stream data from Amazon S3 buckets.

It will work with a lambda function to make sure any new data sent to an S3 bucket triggers an event notification to this Lambda and update the ES index.

All steps are well detailed in amazon doc with Java and Javascript example.

At a high level, setting up to stream data to Amazon ES requires the following steps:

  • Creating an Amazon S3 bucket and an Amazon ES domain
  • Creating a Lambda deployment package.
  • Configuring a Lambda function.
  • Granting authorization to stream data to Amazon ES.

Upvotes: 19

user1832464
user1832464

Reputation:

The only way to do this will be via CloudSearch, which can use S3 as a source. It works using rapid retrieval to build an index. This should work very well but thoroughly check out the pricing model to make sure that this won't be too costly for you.

The alternative is as Jack said - you'd otherwise need to transfer the files out of S3 to an EC2 and build a search application there.

Upvotes: 22

Related Questions