Igor
Igor

Reputation: 1

Parsing Apache logs efficiently in PHP

Ok, this is the scenario: I need to parse my logs to find how many times image thumbnails have been downloaded without actually watching the "large image" page... This is basically a hotlink protection system based on a ratio of "thumb" to "full" image views

Considering the server is bombarded constantly by requests to the thumbnails, the most efficient solution seems to use buffered apache logs that write to disk once every, say, 1Mb, and then parse the logs periodically

My question is this: how do I parse an apache log in PHP to save the data, with the following being true:

The php logger script would be called once every 60 seconds and process whatever amount of log lines it can during that time..

I've tried hacking some code together but I have problems using a minimum amount of memory, finding a way to keep track of the pointer with a "moving" filesize

Here's a part of the log:

212.180.168.244 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441268.jpg HTTP/1.1" 200 3072 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"
122.53.168.123 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441276.jpg HTTP/1.1" 200 3007 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"
143.22.203.211 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441282.jpg HTTP/1.1" 200 4670 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"

Attaching the code for your review here:

<?php
//limit for running it every minute
error_reporting(E_ALL);
ini_set('display_errors',1);
set_time_limit(0);
include(dirname(__FILE__).'/../kframework/kcore.class.php');
$aj = new kajaxpage;
$aj->use_db=1;
$aj->init();
$db=kdbhandler::getInstance();
$d=kdebug::getInstance();
$d->debug=TRUE;
$d->verbose=TRUE;

$log_file = "/var/log/nginx/access.log"; //full path to log file when run by cron
$pid_file = dirname(__FILE__)."/../kframework/cron/cron_log.pid";
//$images_id = array("8308086", "7485151", "6666231", "8343336");

if (file_exists($pid_file)) {
    $pid = file_get_contents($pid_file);
    $temp = explode(" ", $pid);
    $pid_timestamp = $temp[0];
    $now_timestamp = strtotime("now");
    //if (($now_timestamp - $pid_timestamp) < 90) return;
    $pointer = $temp[1];
    if ($pointer > filesize($log_file)) $pointer = 0;
}
else $pointer = 0;

$pattern = "/([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})[^\[]*\[([^\]]*)\][^\"]*\"([^\"]*)\"\s([0-9]*)\s([0-9]*)(.*)/";
$last_time = 0;
$lines_processed=0;

if ($fp = fopen($log_file, "r+")) {
    fseek($fp, $pointer);
    while (!feof($fp)) {
        //if ($lines_processed>100) exit;
        $lines_processed++;
        $log_line = trim(fgets($fp));
        if (!empty($log_line)) {
            preg_match_all($pattern, $log_line, $matches);
            //print_r($matches);
            $size = $matches[5][0];
            $matches[3][0] = str_replace("GET ", "", $matches[3][0]);
            $matches[3][0] = str_replace("HTTP/1.1", "", $matches[3][0]);
            $matches[3][0] = str_replace(".jpg/", ".jpg", $matches[3][0]);
            if (substr($matches[3][0],0,3) == "/t/") {
                $get = explode("-",end(explode("/",$matches[3][0])));
                $imgid = $get[0];
                $type='thumb';
            }
            elseif (substr($matches[3][0], 0, 5) == "/img/") {
                $get1 = explode("/", $matches[3][0]);
                $get2 = explode("-", $get1[2]);
                $imgid = $get2[0];
                $type='raw';
            }
            echo $matches[3][0];
            // put here your sql insert or update
            $imgid=(int) $imgid;
            if (isset($type) && $imgid!=1) {
                switch ($type) {
                    case 'thumb':
                        //use the second slave in the registry
                        $sql=$db->slave_query("INSERT INTO hotlink SET thumbviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE thumbviews=thumbviews+1 ",2);
                        echo "INSERT INTO hotlink SET thumbviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE thumbviews=thumbviews+1";
                    break;
                    case 'raw':
                        //use the second slave in the registry
                        $sql=$db->slave_query("INSERT INTO hotlink SET rawviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE rawviews=rawviews+1",2);
                        echo "INSERT INTO hotlink SET rawviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE rawviews=rawviews+1";
                    break;
                }
            }

            // $imgid - image ID
            // $size - image size

            $timestamp = strtotime("now");
            if (($timestamp - $last_time) > 30) {
                file_put_contents($pid_file, $timestamp . " " . ftell($fp));
                $last_time = $timestamp;
            }
        }
    }
    file_put_contents($pid_file, (strtotime("now") - 95) . " " . ftell($fp));
    fclose($fp);
}

?>

Upvotes: 0

Views: 3663

Answers (4)

Monkey Code
Monkey Code

Reputation: 602

I know this answer is late, but could still help (code can always be improved).

The 10Gb file size and memory required sound like your main problems. Apache does support multiple log files and the real power of multiple log files come from the ability to create log files in different formats http://httpd.apache.org/docs/1.3/multilogs.html

Create a second log file with only the minimal data you need for your real time log monitoring. In this case you might be able to strip the user-agent string etc from being in the log in the first place.

Based on your example log lines this may halve the amount of data required PHP has to load.

Upvotes: 1

Svish
Svish

Reputation: 158331

Maybe you can tweak my PHP version of tail to search for your last timestamp rather than counting lines, and then reading lines from that point, dealing with them one by one?

Would give it a try myself as I'm a bit curious, but unfortunately unable to do so right now :(

Upvotes: 1

Evert
Evert

Reputation: 99831

I'd personally send the log entries to a running script instead. Apache will allow this with by starting the filename for the log with a pipe (|). If this doesn't work, you can create a fifo as well (see mkfifo).

The running script (whatever it is) can buffer x lines and do what it needs to do based on that. Reading the data isn't all that hard, and shouldn't be where your bottleneck will be.

I do suspect that you will run into issues with your INSERT statements on the database.

Upvotes: 0

Cybercartel
Cybercartel

Reputation: 12592

A solution would be to store the log into a mysql database. Maybe you can write a C language program to parse the log file after storing it in mysql. It would be a magnitude more faster and it's not very difficult. Another option would be to use phyton but I think using a database is necessary. You can use a full text index to match your string. Python can be compiled to a binary either. This makes it more efficiently. As per request: The log file stacks incremental. It's not that you give 10GB at once.

Upvotes: 0

Related Questions