Reputation: 1630
I've been considering following options.
senseidb [http://www.senseidb.com] This needs a fixed schema also data gateways. So there is no simple way to push data but provide data streams. My data is unstuctured and there are very few common attributes across all kinds of logs
riak[http://wiki.basho.com/Riak-Search.html]
vertica - cost factor?
Hbase(+Hadoop ecosystem +lucene) - main cons here are on single machine this wont make much sense and am not sure about free text search capability to be built around this
Main requirements are 1. it has to sustain thousands of incoming request for archival and at the same time build real-time index which will allow end user to do free-text search
Upvotes: 0
Views: 1617
Reputation: 5100
There are number of specialized log storage and indexing, I don't know that I'd cram logs into a normal data store necessarily.
If you have lots of money, it's tough to beat Splunk.
If you'd prefer an open source option, check out the ServerFault discussion. logstash + ElasticSearch seems to be a really strong choice, and should grow pretty well as your logs do.
Upvotes: 1
Reputation: 8088
For the 2-3 TB of data sounds like a "in the middle" case. If it is all the data I would not suggest going into BigData / NoSQL venture.
I think RDBMS with full text search capability should do on good hardware. I would suggest to do some aggressive partitioning by time to be able to work with 2-3 TB data. Without partitioning it would be too mach. In the same time - if your data will be partitioned by days i think data size will be fine for MySQL.
Taking to the account the comment below that data size is about 10-15TB, and taking into account that need for some replication will multiply this number x2-x3. We also should consider size of indexes which I would estimate as dozens percents from the data size. Probably efficient single node solution might be more expensive then clustering mostly because of licensing costs.
In best of my understanding existing Hadoop/NoSQL solutions can not answer your requirements out of the box, mostly because of number of documents to be indexed. In out case - each log is a document. (http://blog.mgm-tp.com/2010/06/hadoop-log-management-part3/)
So I think solution will be in aggregating logs for some period of time together, and threating it as one document.
For the storage of these logs packages HDFS or Swift could be a good solutions.
Upvotes: 0
Reputation: 698
Have you given a thought on the line of these implementation. It might be helpful to integrate Lucene and Hadoop for you problem.
http://www.cloudera.com/blog/2011/09/hadoop-for-archiving-email/ http://www.cloudera.com/blog/2012/01/hadoop-for-archiving-email-part-2/
So instead of email, your use case could use the log files and the parameters to index.
Upvotes: 0