PHP file_exists or is_file does not answer correctly for 10-20s on NFS files (EC2)

Question

We have an nginx/php-fpm setup on EC2 that receives file chunks to an NFS-mounted "chunk" folder (SoftNAS specifically) that is shared among multiple app servers. We have a problem where the app checks for the existence of the file before uploading the finished file to S3, but the file check is failing even though the file is there.

The app has a clearstatcache() in place prior to the is_file() or file_exists() (we've tried both) but the file does not become visible to the app for 10-20s.

This is the output of some runs of that test:

 app1 write timestamp 1484702190.5575
 app2 read timestamp  1484702216.0643
 25.5068 seconds

 app1 write timestamp 1484702229.0130
 app2 read timestamp  1484702246.0652
 17.0522 seconds

 app1 write timestamp 1484702265.6277
 app2 read timestamp  1484702276.0646
 10.4369 seconds

 app1 write timestamp 1484702286.0136
 app2 read timestamp  1484702306.0645
 20.0509 seconds

 app1 write timestamp 1484702314.4844
 app2 read timestamp  1484702336.0648
 21.5804 seconds

 app1 write timestamp 1484702344.3694
 app2 read timestamp  1484702366.0644
 21.6950 seconds

 app1 write timestamp 1484702374.0460
 app2 read timestamp  1484702396.0645
 22.0185 seconds

 app1 write timestamp 1484702404.0346
 app2 read timestamp  1484702426.0647
 22.0301 seconds

 app1 write timestamp 1484702434.2560
 app2 read timestamp  1484702456.1092
 21.8532 seconds

 app1 write timestamp 1484702466.0083
 app2 read timestamp  1484702486.1085
 20.1002 seconds

 app1 write timestamp 1484702496.5466
 app2 read timestamp  1484702516.1088
 19.5622 seconds

 app1 write timestamp 1484702525.2703
 app2 read timestamp  1484702546.1089
 20.8386 seconds

 app1 write timestamp 1484702558.3312
 app2 read timestamp  1484702576.1092
 17.7780 seconds

We've tried a number of variations on checking the file:

Using the is_file and file_exists functions when checking if the file exists on app2.
All variations of using the clearstatcache function before checking if the file exists on app2.
Touch the file before writing to it on app1.
Touch the file before checking if it exists on app2.
Using different methods of writing the file from app1 and explicitly closing the write stream and releasing the lock on the file.
Varying delays in between the read loops (such as no delay or up to a 1 second delay).
Using exec to "ls" the directory after writing to it from app1.
Using exec to "ls" the directory before every file exists check on app2.

None of those things seems to make any difference. We didn't run extensive tests on each option, but each one seemed to take excessively long before the file exists check passed.

There's one thing that did work. Running a loop of "ls" on app2 in a shell, the file is instantly readable by the app2 script.

 app1 write timestamp 1484703581.3749
 app2 read timestamp  1484703581.3841
 0.0092 seconds

 app1 write timestamp 1484703638.81 00
 app2 read timestamp  1484703638.8139
 0.0039 seconds

 app1 write timestamp 1484703680.8548
 app2 read timestamp  1484703680.8576
 0.0028 seconds

So, something in the shell is correctly clearing the NFS cache, but the clear cache command in PHP doesn't appear to make any difference.

(Edit) the code in question:

public static function get($filepath) {

clearstatcache(TRUE, $filepath);

if (file_exists($filepath)) {
    $instance = new static::$_class;
    $instance->init($filepath);
    return $instance;
} else {

    // Sometimes a new file is not found with the first is_file() attempt.
    // Clear the stat cache and try to find the file again.

    clearstatcache(TRUE, $filepath);

    if (file_exists($filepath)) {
        $instance = new static::$_class;
        $instance->init($filepath);
        return $instance;
    }
}

Log::error("AJRFSFILE " . $_SERVER['PATH_INFO'] . " "  . $_SERVER['HTTP_DEVICE'] . " " . $filepath . " " . json_encode(stat($filepath)));

return false;
}

(Edit2) Turns out running exec() with an "ls" in the code successfully clears whatever file level caching is taking place at the system level, but for obvious reasons an exec() every time we do a file_exists is a sub-optimal solution.

bishop · Accepted Answer

Here's what's going on. The PHP stat cache relies on the atime attribute, which is available from the underlying VFS. When NFS powers the VFS, the attributes are subject to caching to reduce server round-trips. These, unfortunately, can cause PHP to "lie" about the state, because in reality the NFS server hasn't given the VFS current information.

You can force immediate coherence with the noac mount option. I recommend using this on any server where you absolutely, positively, need the latest information in the shortest possible time:

Use the noac mount option to achieve attribute cache coherence among multiple clients. Almost every file system operation checks file attribute information. The client keeps this information cached for a period of time to reduce network and server load. When noac is in effect, a client’s file attribute cache is disabled, so each operation that needs to check a file’s attributes is forced to go back to the server. This permits a client to see changes to a file very quickly, at the cost of many extra network operations.

If noac is too slow, there are other mount options that might better tune the cache for your needs. See: lookupcache and actimeo. For example, decreasing actimeo will lower the time that NFS locally caches information: the default being 30 seconds (minimum) to 60 seconds (maximum). Or, as another example, lookupcache=positive will provide faster intelligence on the appearance of new files, but will long cache their existence even after unlinking.

But why, when not having these mount options, does ls in the directory "fix" the issue? Turns out that opendir and closedir sequence invalidates the NFS attribute cache, which forces a call back to the server.

So in your case you use an opendir()/closedir() sequence to invalidate the cache. I'm not sure if system("ls") will work, as I believe each process has a different view of the underlying attribute cache, but it's worth a try.

PHP file_exists or is_file does not answer correctly for 10-20s on NFS files (EC2)

Answers (1)

Related Questions