Reputation: 477
I'd like a more efficient way for my Perl Script to parse through syslogs.
My script runs on a cron on an hourly basis to output some statistics. I noticed it takes almost 5-10 minutes to complete as the day progresses (syslogs are archived daily) because the syslog files are several GB in size and the script does a simple:
open LOG, $logfile or die "fatal error. Could not open $logfile"
The problem is that initially the first hours worth of logs are the first lines in the logs. As the day progresses, the "current hour" of log entries in the syslog are now from, let's say lines 600000 to 700000. So each hour it gets slower and slower.
A complex approach would be to have a grep run against the file based on the time and store the results in a tmp file then have my perl script process the tmp file, then remove the tmp file, repeat.
Is there a more programmatic way to ensure I'm not rereading the thousands of lines everytime?
SK
Upvotes: 0
Views: 869
Reputation: 165110
You have a bunch of possible solutions.
First is to implement hourly, instead of daily, log rotation. Then your program only has to read the hourly log file. This is probably a good idea in general if your logs are getting into the gigabyte range per day.
If that's not possible, there's probably work which can be done to improve the performance of your search code. The first step would be to run a code profiler like Devel::NYTProf to find out where your program is spending its time.
Instead of doing a linear search you can do a binary search. Assuming your logfile entries are something like this:
Mar 22 01:22:34 blah blah blah
Mar 22 01:22:35 blah blah blah
seek
to the halfway point of the file, read a partial line, throw it out, and read the next full line. Check its timestamp. If its too new, seek
backwards half the remaining space, if it's too old, seek
forward half the remaining space. Repeat until you find the start of the hour.
For a billion records this will take about log2(230) or 30 steps.
Another option is to read the file backwards. Start at the end (the newest log entry) and work back until you hit the start of the hour. File::ReadBackwards can do this fairly efficiently.
You could change your log statistics program to write its results to a database including the position in the log file of the last record it wrote. Then the next time it runs it seek
s to that position, verifies it's correct, and reads forward from there.
Finally, consider using a database. You can have syslogd log to a database itself, this avoids the overhead of every program having to log to the database. For example, rsyslog and syslog-ng can do this.
Upvotes: 7