Dan
Dan

Reputation: 3377

Atomically file_get_contents + file_put_contents

I've got a piece of code that truncates a CSV log file to a specified trailing period of time. The first field in each CSV entry is a timestamp.

The below correctly purges, but is not atomically truncating the log, resulting in the possibility lost lost log entries between file_get_contents & file_put_contents. Since new entries go at the bottom of the file, there is no risk of corrupting the log prior to this point.

I considered manually performing the operations within file_get_contents & file_put_contents, but the PHP docs claim that these operations do all kind of super fun voodoo optimizations and are the recommended method of doing what I want (getting all file contents as a string and populating a file with a string), so I was curious if there is a way to use these functions without being unsafe.

$time = time();
$fp = @fopen( $file, 'r' );
if ( $fp !== false ) {
    $truncate = false;
    $offset   = 0;

    // find the first non-expired entry
    while ( ( $fields = fgetcsv( $fp ) ) !== false ) {
        if ( ! is_null( $fields ) && $time > ( $fields[0] + $purge_interval ) ) {
            // we've reached the recent entries -- nothing beyond here will be removed
            break;
        }

        $offset   = @ftell( $fp );
        if ( false === $offset ) {
            break;
        }

        $truncate = true;
    }

    @fclose( $fp );

    if ( $truncate ) {
        // need the next two lines atomically performed...
        $data = file_get_contents( $file, false, null, $offset );
        file_put_contents( $file, $data, LOCK_EX );
    }
}

Upvotes: 1

Views: 2562

Answers (2)

Sven
Sven

Reputation: 70893

I think logfiles follow the pattern of "append, write only" for a reason: It is hard to make them performant and editable at the same time. That's why the usual log files get rotated atomically in the filesystem by a cron job, to allow cutting off the old part, possibly compacting or eventually deleting it, while allowing new data to be stored in a fresh file.

So I'd try to separate the creation of log entries from processing them by having separate files. Create a new log file every day, or every hour. Deal with the old files after a new file has started.

Upvotes: 1

Jon
Jon

Reputation: 437554

There is no bulletproof way of doing in-place concurrent modifications like that. The process will have to drop one of these attributes in order to be implementable.

Since you also control the log writers, a simple and good solution is to drop the absolute concurrency and synchronize access to the log with flock. The log writers would periodically open the log to append to it, and both them and the truncation process would also lock the log file during their operations.

For example, the truncation utility would do

if (flock($fp, LOCK_EX)) {
    $data = file_get_contents( $file, false, null, $offset );
    file_put_contents( $file, $data, LOCK_EX );
    flock($fp, LOCK_UN);
}

The log writers would also acquire the lock before writing to the file. One point of interest is that the writers might prefer to try non-blocking locks and if busy continue storing logs in-memory so as not to block the process for an unknown amount of time; in this case the process would be attempted again periodically.

Upvotes: 2

Related Questions