whatupwilly
whatupwilly

Reputation: 5177

Database suggestion for processing/reporting on large amount of log file type data

We have an app that creates text log files of requests to it. Pretty typical stuff in the log file and it is space delimited (date, time, url, http code, ip, user agent, etc).

Currently, we are generating around 500k entries in the text log files per day.

We're currently doing a lot of analysis via sed/awk/grep of the text files. However, that isn't really going to scale especially as we want to start reporting across multiple days:

e.g. - How many times did this IP address hit this URL in the last 5 days - What % of requests resulted in 500s for specific URL's

It's easy enough to do regular imports into a mysql db and pull this type of data with select/group-bys. However, even with a few hundred thousand rows, the queries are relatively slow.

I'm a n00b when it comes to some of the new no-sql dbs out there (Casandra, Dynamo, BigTable) but would any of them be well suited for this? I'm continuing reading up on them but maybe this crew had some recommendations.

Thanks!

Upvotes: 3

Views: 753

Answers (3)

valyala
valyala

Reputation: 17850

If yo know in advance all the fields per each log line, then the best solution would be to create a table in ClickHouse with a column per each field in the log line, and then to ingest logs into this table. Then you can query the ingested data in real time at very fast speed. ClickHouse is able to scan tens of billions of rows per second in a single-node setup. ClickHouse also compresses typical logs, so they occupy much lower amounts of disk space than the original logs stored in files. Uber, Cloudflare and Sentry successfully use ClickHouse for storing and analyzing petabytes of logs. See this, this and this articles for details.

P.S. you can also try using log database I work on - VictoriaLogs. It is built on the same architecture ideas as ClickHouse for achieving high performance and low resource usage, but it is optimized solely for painless work with logs. For example, its' query language - LogsQL - is easier to use comparing to SQL for typical log debugging, full-text search and analysis.

Upvotes: 0

LesterDove
LesterDove

Reputation: 3044

There are a couple of reasons why I wouldn't necessarily look right away to a NoSQL solution:

  • Yours is a known schema which sounds like it won't be changing.

  • There doesn't seem to be a lot of denormalizing potential for you, as you've pretty much got a single flat table structure.

  • You haven't made any reference to application scalability (# of users), just the size of the query.

And those are three of the big 'wins' for NoSQL as I know it.

That being said, I'm no expert, and I don't know for sure that it wouldn't make for faster reads, so it's definitely worth a try!

Upvotes: 0

srkiNZ84
srkiNZ84

Reputation: 3568

We've had a similar problem at work and managed to solve it by dumping the data into a column based database. These kinds of databases are much better at analytical queries of the kind you're describing. There are several options:

http://en.wikipedia.org/wiki/Column-oriented_DBMS

We've had good experience with InfiniDB:

http://infinidb.org/

Using this approach we managed to speed up the queries by approx. 10x, however is not a silver bullet and eventually you'll run into the same problems again.

You might also want to look at partitioning your data to improve performance.

Upvotes: 2

Related Questions