Ketan Khairnar
Ketan Khairnar

Reputation: 1630

need a solution for archiving logs and having real-time search functionality

I've been considering following options.

  1. senseidb [http://www.senseidb.com] This needs a fixed schema also data gateways. So there is no simple way to push data but provide data streams. My data is unstuctured and there are very few common attributes across all kinds of logs

  2. riak[http://wiki.basho.com/Riak-Search.html]

  3. vertica - cost factor?

  4. Hbase(+Hadoop ecosystem +lucene) - main cons here are on single machine this wont make much sense and am not sure about free text search capability to be built around this

Main requirements are 1. it has to sustain thousands of incoming request for archival and at the same time build real-time index which will allow end user to do free-text search

  1. storage (log archives + index ) has to be optimal

Upvotes: 0

Views: 1617

Answers (3)

MrKurt
MrKurt

Reputation: 5100

There are number of specialized log storage and indexing, I don't know that I'd cram logs into a normal data store necessarily.

If you have lots of money, it's tough to beat Splunk.

If you'd prefer an open source option, check out the ServerFault discussion. logstash + ElasticSearch seems to be a really strong choice, and should grow pretty well as your logs do.

Upvotes: 1

David Gruzman
David Gruzman

Reputation: 8088

For the 2-3 TB of data sounds like a "in the middle" case. If it is all the data I would not suggest going into BigData / NoSQL venture.
I think RDBMS with full text search capability should do on good hardware. I would suggest to do some aggressive partitioning by time to be able to work with 2-3 TB data. Without partitioning it would be too mach. In the same time - if your data will be partitioned by days i think data size will be fine for MySQL.
Taking to the account the comment below that data size is about 10-15TB, and taking into account that need for some replication will multiply this number x2-x3. We also should consider size of indexes which I would estimate as dozens percents from the data size. Probably efficient single node solution might be more expensive then clustering mostly because of licensing costs.
In best of my understanding existing Hadoop/NoSQL solutions can not answer your requirements out of the box, mostly because of number of documents to be indexed. In out case - each log is a document. (http://blog.mgm-tp.com/2010/06/hadoop-log-management-part3/)
So I think solution will be in aggregating logs for some period of time together, and threating it as one document.
For the storage of these logs packages HDFS or Swift could be a good solutions.

Upvotes: 0

javanx
javanx

Reputation: 698

Have you given a thought on the line of these implementation. It might be helpful to integrate Lucene and Hadoop for you problem.

http://www.cloudera.com/blog/2011/09/hadoop-for-archiving-email/ http://www.cloudera.com/blog/2012/01/hadoop-for-archiving-email-part-2/

So instead of email, your use case could use the log files and the parameters to index.

Upvotes: 0

Related Questions