Reputation: 131
I am looking for a good way to store up to 20 terabytes of data (social media postings, twitter data, etc) in the cloud and gradually feed it into Elasticsearch (to enable faceted searching) so that it can be quickly searched. I was going to break this into 2 steps. Saving the data to storage and then indexing it (the next day or the next month). I have seen mention of Redis. Would this be appropriate? Would it be better to use AWS and S3 or Google to do this? Is there a better way to do this then using Redis? Once the data is indexed I don't need the original data anymore.
Upvotes: 1
Views: 900
Reputation: 165
AWS is a natural fit, the S3 uploads are free. They have a hosted ElasticSearch and Redis/ElasticCache, or you could host your own on EC2. Redis is an in-memory key value store not well suited for dynamic search, whereas ElasticSearch is a persisted document store perfectly suited for search and aggregation.
If you enable S3 Events, then a file create event could trigger an AWS Lamba written in Python or other language, to automatically read your data whenever a file appears and insert using ElasticSearch http API. The first 1 million Lamba executions per month are free. The ElasticSearch index properties let you choose which fields that will automatically be indexed for search.
After you finish with the AWS data then delete it or change its storage type to Infrequent access or reduced redundancy to save on your bill. I use http://www.insight4storage.com/ to lower my S3 costs by tracking my storage usage trends.
Upvotes: 1