dooz
dooz

Reputation: 161

Ideal system for storing and parsing text logs and reports

I have a lot of text reports and log files from running networking tests. I want to store these reports and logs in a data store where I can parse them and run reports based on the parsed data. I also want this system to be extensible, both in the types of reports & logs it accepts and the amount of data and queries / reports it can be used for.

A coworker suggested Hadoop as possibly fulfilling this need, and another team in my organization says they use Cassandra for a similar project (but with much more data, most of it machine-generated). I've been reading about Hadoop and Cassandra and I'm really not sure whether using something like that would be overkill and whether a relational database with a custom parser for each log/report type would be more sane.

From my understanding of Hadoop, I'd need to write MapReduce functions to parse each type of input data anyway, and I think I'd need to do something similar if I used Cassandra. I also have read a little about Hive, which sounds like it might be of use, but I haven't looked into it very deeply.

What are the benefits (if any) to using Hadoop or Cassandra (or something else) in my situation?

Any sort of advice would be appreciated.

Upvotes: 2

Views: 269

Answers (1)

larsen
larsen

Reputation: 1431

Here what I get from the description of your problem:

  • You have some testing procedures that generate logs and text reports. Can you give at least a rough idea of the size of this data?
  • You want to analyze this data after they're generated (i.e. there's no need for real time analysis)
  • You want flexibility on the size of data you can ingest and process, and on the type of queries and analysis you can do

Here some insight and caveats about the tools you mentioned:

  • Given a Hadoop cluster already configured, Hive is probably the simplest solution: it will let you treat your data as if it was a set of tables: SQL queries, joins, and so on… Hive is (roughly) as quick as your cluster is big, but you won't have instant answers: in other words you can use it for batch operations, not for interactive web panels and things like that.

  • Cassandra is useful to store great quantities of data. It scales easily, it's robust and relatively easy to use. What I think might be a concern given your requirements is that it requires to think very throughfully the schema you're going to use to store the data: the schema is going to determine what you can and cannot do afterwards. Thus, if you want to perform broader analysis, or read data in new ways you can't imagine today, it might turn out you can't, because of the way data is stored in the database.

Other options I'm less familiar with: HBase (data storage based on HDFS), Pig (like Hive, queries are compiled into Hadoop jobs; what changes is the model: instead of SQL queries you need to write "flows").

I suggest trying Hive (or Pig), maybe using services like Amazon EMR (so that you can avoid the hassle of the Hadoop cluster setup).

Upvotes: 1

Related Questions