Jesse
Jesse

Reputation: 231

PHP: Writing a lot of small files the fastest or/and most efficient way

Imagine that a campaign will have 10,000 to 30,000 files about 4kb each should be written to disk.

And, there will be a couple of campaigns running at the same time. 10 tops.

Currently, I'm going with the usual way: file_put_contents.

it gets the job done but in a slow way and its php process is taking 100% cpu usage all the way.

fopen, fwrite, fclose, well, the result is similar to file_put_contents.

I've tried some async io stuff such as php eio and swoole.

it's faster but it'll yield "too many open files" after some time.

php -r 'echo exec("ulimit -n");' the result is 800000.

Any help would be appreciated!


well, this is sort of embarrassing... you guys are correct, the bottleneck is how it generates the file content...

Upvotes: 9

Views: 6093

Answers (3)

LSerni
LSerni

Reputation: 57428

I am assuming that you cannot follow SomeDude's very good advice on using databases instead, and you already have performed what hardware tuning could be performed (e.g. increasing cache, increasing RAM to avoid swap thrashing, purchasing SSD drives).

I'd try and offload the file generation to a different process.

You could e.g. install Redis and store the file content into the keystore, which is very fast. Then, a different, parallel process could extract the data from the keystore, delete it, and write to a disk file.

This removes all disk I/O from the main PHP process, and lets you monitor the backlog (how many keypairs are still unflushed: ideally zero) and concentrate on the bottleneck in content generation. You'll possibly need some extra RAM.

On the other hand, this is not too different from writing to a RAM disk. You could also output data to a RAM disk, and it would be probably even faster:

# As root
mkdir /mnt/ramdisk
mount -t tmpfs -o size=512m tmpfs /mnt/ramdisk
mkdir /mnt/ramdisk/temp 
mkdir /mnt/ramdisk/ready
# Change ownership and permissions as appropriate

and in PHP:

$fp = fopen("/mnt/ramdisk/temp/{$file}", "w");
fwrite($fp, $data);
fclose($fp);
rename("/mnt/ramdisk/temp/{$file}", "/mnt/ramdisk/ready/{$file}");

and then have a different process (crontab? Or continuously running daemon?) move files from the "ready" directory of the RAM disk to the disk, deleting then the RAM ready file.

File System

The time required to create a file depends on the number of files in the directory, with various dependency functions that themselves depend on the file system. ext4, ext3, zfs, btrfs etc. will exhibit different behaviour. Specifically, you might experience significant slowdowns if the number of files exceeds some quantity.

So you might want to try timing the creation of a large number of sample files in one directory, and see how this time grows with the growth of the number. Keep in mind that there will be a performance penalty for access to different directories, so using straight away a very large number of subdirectories is again not recommended.

<?php
    $payload    = str_repeat("Squeamish ossifrage. \n", 253);
    $time       = microtime(true);
    for ($i = 0; $i < 10000; $i++) {
        $fp = fopen("file-{$i}.txt", "w");
        fwrite($fp, $payload);
        fclose($fp);
    }
    $time = microtime(true) - $time;
    for ($i = 0; $i < 10000; $i++) {
        unlink("file-{$i}.txt");
    }
    print "Elapsed time: {$time} s\n";

Creation of 10000 files takes 0.42 seconds on my system, but creation of 100000 files (10x) takes 5.9 seconds, not 4.2. On the other hand, creating one eighth of those files in 8 separate directories (the best compromise I found) takes 6.1 seconds, so it's not worthwhile.

But suppose that creating 300000 files took 25 seconds instead of 17.7; dividing those files in ten directories might take 22 seconds, and make the directory split worthwhile.

Parallel processing: r strategy

TL;DR this doesn't work so well on my system, though your mileage may vary. If the operations to be done are lengthy (here they are not) and differently bound from the main process, then it can be advantageous to offload them each to a different thread, provided you don't spawn too many threads.

You will need pcntl functions installed.

$payload    = str_repeat("Squeamish ossifrage. \n", 253);

$time       = microtime(true);
for ($i = 0; $i < 100000; $i++) {
    $pid = pcntl_fork();
    switch ($pid) {
        case 0:
            // Parallel execution.
            $fp = fopen("file-{$i}.txt", "w");
            fwrite($fp, $payload);
            fclose($fp);
            exit();
        case -1:
            echo 'Could not fork Process.';
            exit();
        default:
            break;
    }
}
$time = microtime(true) - $time;
print "Elapsed time: {$time} s\n";

(The fancy name r strategy is taken from biology).

In this example, spawning times are catastrophic if compared to what each child needs to do. Therefore, overall processing time skyrockets. With more complex children things would go better, but you must be careful not to turn the script into a fork bomb.

One possibility, if possible, could be to divide the files to be created into, say, chunks of 10% each. Each child would then change its working directory with chdir(), and create its files in a different directory. This would negate the penalty for writing files in different subdirectories (each child writes in its current directory), while benefiting from writing less files. In this case, with very lightweight and I/O bound operations in the child, again the strategy isn't worthwhile (I get doubled execution time).

Parallel processing: K strategy

TL;DR this is more complex but works well... on my system. Your mileage may vary. While r strategy involves lots of fire-and-forget threads, K strategy calls for a limited (possibly one) child which is nurtured carefully. Here we offload the creation of all the files to one parallel thread, and communicate with it via sockets.

$payload    = str_repeat("Squeamish ossifrage. \n", 253);

$sockets = array();
$domain = (strtoupper(substr(PHP_OS, 0, 3)) == 'WIN' ? AF_INET : AF_UNIX);
if (socket_create_pair($domain, SOCK_STREAM, 0, $sockets) === false) {
   echo "socket_create_pair failed. Reason: ".socket_strerror(socket_last_error());
}
$pid = pcntl_fork();
if ($pid == -1) {
    echo 'Could not fork Process.';
} elseif ($pid) {
    /*parent*/
    socket_close($sockets[0]);
} else {
    /*child*/
    socket_close($sockets[1]);
    for (;;) {
        $cmd = trim(socket_read($sockets[0], 5, PHP_BINARY_READ));
        if (false === $cmd) {
            die("ERROR\n");
        }
        if ('QUIT' === $cmd) {
            socket_write($sockets[0], "OK", 2);
            socket_close($sockets[0]);
            exit(0);
        }
        if ('FILE' === $cmd) {
            $file   = trim(socket_read($sockets[0], 20, PHP_BINARY_READ));
            $len    = trim(socket_read($sockets[0], 8, PHP_BINARY_READ));
            $data   = socket_read($sockets[0], $len, PHP_BINARY_READ);
            $fp     = fopen($file, "w");
            fwrite($fp, $data);
            fclose($fp);
            continue;
        }
        die("UNKNOWN COMMAND: {$cmd}");
    }
}

$time       = microtime(true);
for ($i = 0; $i < 100000; $i++) {
    socket_write($sockets[1], sprintf("FILE %20.20s%08.08s", "file-{$i}.txt", strlen($payload)));
    socket_write($sockets[1], $payload, strlen($payload));
    //$fp = fopen("file-{$i}.txt", "w");
    //fwrite($fp, $payload);
    //fclose($fp);
}
$time = microtime(true) - $time;
print "Elapsed time: {$time} s\n";

socket_write($sockets[1], "QUIT\n", 5);
$ok = socket_read($sockets[1], 2, PHP_BINARY_READ);
socket_close($sockets[1]);

THIS IS HUGELY DEPENDENT ON THE SYSTEM CONFIGURATION. For example on a mono-processor, mono-core, non-threading CPU, this is madness - you'll at least double the total runtime, but more likely it will go from three to ten times as slow.

So this is definitely not the way to pimp up something running on an old system.

On a modern multithreading CPU and supposing the main content creation loop is CPU bound, you may experience the reverse - the script might go ten times faster.

On my system, the "forking" solution above runs a bit less than three times faster. I expected more, but there you are.

Of course, whether the performance is worth the added complexity and maintenance, remains to be evaluated.

The bad news

While experimenting above, I came to the conclusion that file creation on a reasonably configured and performant machine in Linux is fast as hell, so not only it's difficult to squeeze more performances, but if you're experiencing slowness, it's very likely that it is not file related. Try detailing some more about how you create that content.

Upvotes: 14

CatalinB
CatalinB

Reputation: 581

The main idea is to have less files. Ex: 1,000 files can be added in 100 files, each containing 10 files - and parsed with explode and you will get 5x faster on write and 14x faster on read+parse
with file_put_contents and fwrite optimized, you will not get more than 1.x speed. This solution can be useful for read/write. Other solution may be mysql or other db.

On my computer to create 30k files with a small string it takes 96.38 seconds and to append 30k times same string in one file it takes 0.075 sec

I can offer you an unusual solution, when you can use it fewer times file_put_contents function. bellow this i show you a simple code to understand how it works.

$start = microtime(true);

    $str = "Aaaaaaaaaaaaaaaaaaaaaaaaa";

    if( !file_exists("test/") ) mkdir("test/");

    foreach( range(1,1000) as $i ) {
        file_put_contents("test/".$i.".txt",$str);
    }

    $end = microtime(true); 
    echo "elapsed_file_put_contents_1: ".substr(($end - $start),0,5)." sec\n";

    $start = microtime(true);


    $out = '';
    foreach( range(1,1000) as $i ) {
        $out .= $str;
    }
    file_put_contents("out.txt",$out);

    $end = microtime(true); 
    echo "elapsed_file_put_contents_2: ".substr(($end - $start),0,5)." sec\n";

this is a full example with 1000 files and elapsed time

with 1000 files writing file_put_contens: elapsed: 194.4 sec writing file_put_contens APPNED :elapsed: 37.83 sec ( 5x faster ) ............ reading file_put_contens elapsed: 2.401 sec reading append elapsed: 0.170 sec ( 14x faster )

    $start = microtime(true);

    $allow_argvs = array("gen_all","gen_few","read_all","read_few");

    $arg = isset($argv[1]) ? $argv[1] : die("php ".$argv[0]." gen_all ( ".implode(", ",$allow_argvs).")");

    if( !in_array($arg,$allow_argvs) ) {
        die("php ".$argv[0]." gen_all ( ".implode(", ",$allow_argvs).")");
    }


    if( $arg=='gen_all' ) {

        $dir_campain_all_files = "campain_all_files/";
        if( !file_exists($dir_campain_all_files) ) die("\nFolder ".$dir_campain_all_files." not exist!\n");

        $exists_campaings = false;
        foreach( range(1,10) as $i ) { if( file_exists($dir_campain_all_files.$i) ) { $exists_campaings = true; } }
        if( $exists_campaings ) {
            die("\nDelete manualy all subfolders from ".$dir_campain_all_files." !\n");
        }   
        build_campain_dirs($dir_campain_all_files);

        // foreach in campaigns
        foreach( range(1,10) as $i ) {
            $campain_dir = $dir_campain_all_files.$i."/";
            $nr_of_files = 1000;  
            foreach( range(1,$nr_of_files) as $f ) {
                $file_name = $f.".txt";
                $data_file = generateRandomString(4*1024);
                $dir_file_name = $campain_dir.$file_name;
                file_put_contents($dir_file_name,$data_file);
            }
            echo "campaing #".$i." done! ( ".$nr_of_files." files writen ).\n";
        }   
    }


    if( $arg=='gen_few' ) { 
        $delim_file = "###FILE###";
        $delim_contents = "@@@FILE@@@";

        $dir_campain = "campain_few_files/";
        if( !file_exists($dir_campain) ) die("\nFolder ".$dir_campain_all_files." not exist!\n");   

        $exists_campaings = false;
        foreach( range(1,10) as $i ) { if( file_exists($dir_campain.$i) ) { $exists_campaings = true; } }
        if( $exists_campaings ) {
            die("\nDelete manualy all files from ".$dir_campain." !\n");
        }           

        $amount = 100; // nr_of_files_to_append

        $out = ''; // here will be appended

        build_campain_dirs($dir_campain);

        // foreach in campaigns
        foreach( range(1,10) as $i ) {
            $campain_dir = $dir_campain.$i."/";

            $nr_of_files = 1000; 
            $cnt_few=1;
            foreach( range(1,$nr_of_files) as $f ) {

                $file_name = $f.".txt";
                $data_file = generateRandomString(4*1024);

                $my_file_and_data = $file_name.$delim_file.$data_file;
                $out .= $my_file_and_data.$delim_contents;

                // append in a new file
                if( $f%$amount==0 ) {
                    $dir_file_name = $campain_dir.$cnt_few.".txt";
                    file_put_contents($dir_file_name,$out,FILE_APPEND);
                    $out = '';
                    $cnt_few++;
                }

            }
            // append remaning files 
            if( !empty($out) ) {
                $dir_file_name = $campain_dir.$cnt_few.".txt";
                file_put_contents($dir_file_name,$out,FILE_APPEND);
                $out = '';

            }
            echo "campaing #".$i." done! ( ".$nr_of_files." files writen ).\n";
        }
    }


    if( $arg=='read_all' ) {    
        $dir_campain = "campain_all_files/";

        $exists_campaings = false;
        foreach( range(1,10) as $i ) {
            if( file_exists($dir_campain.$i) ) {
                $exists_campaings = true;
            }
        }

        foreach( range(1,10) as $i ) {
            $campain_dir = $dir_campain.$i."/";
            $files = getFiles($campain_dir); 
            foreach( $files as $file ) {
                $data = file_get_contents($file);
                $substr = substr($data, 100, 5); // read 5 chars after char100       
            }
            echo "campaing #".$i." done! ( ".count($files)." files readed ).\n";

        }   
    }



    if( $arg=='read_few' ) {
        $dir_campain = "campain_few_files/";

        $exists_campaings = false;
        foreach( range(1,10) as $i ) {
            if( file_exists($dir_campain.$i) ) {
                $exists_campaings = true;
            }
        }

        foreach( range(1,10) as $i ) {
            $campain_dir = $dir_campain.$i."/";
            $files = getFiles($campain_dir); 
            foreach( $files as $file ) {
                $data_temp = file_get_contents($file);
                $explode = explode("@@@FILE@@@",$data_temp);
                //@mkdir("test/".$i);
                foreach( $explode as $exp ) {
                    $temp_exp = explode("###FILE###",$exp);
                    if( count($temp_exp)==2 ) {
                        $file_name = $temp_exp[0];
                        $file_data = $temp_exp[1];
                        $substr = substr($file_data, 100, 5); // read 5 chars after char100     
                        //file_put_contents("test/".$i."/".$file_name,$file_data); // test if files are recreated correctly
                    }
                }
                //echo $file." has ".strlen($data_temp)." chars!\n";
            }
            echo "campaing #".$i." done! ( ".count($files)." files readed ).\n";

        }   
    }

    $end = microtime(true); 
    echo "elapsed: ".substr(($end - $start),0,5)." sec\n";


    echo "\n\nALL DONE!\n\n";






    /*************** FUNCTIONS ******************/


    function generateRandomString($length = 10) {
        $characters = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
        $charactersLength = strlen($characters);
        $randomString = '';
        for ($i = 0; $i < $length; $i++) {
            $randomString .= $characters[rand(0, $charactersLength - 1)];
        }
        return $randomString;
    }

    function build_campain_dirs($dir_campain) {
        foreach( range(1,10) as $i ) {
            $dir = $dir_campain.$i;
            if( !file_exists($dir) ) {
                mkdir($dir);
            }
        }
    }

    function getFiles($dir) {
        $arr = array();
        if ($handle = opendir($dir)) {
            while (false !== ($file = readdir($handle))) {
                if ($file != "." && $file != "..") {
                    $arr[] = $dir.$file;
                }
            }
            closedir($handle);
        }
        return $arr;
    }   

Upvotes: 1

SomeDude
SomeDude

Reputation: 320

Having read your description, I understand you're writing many files that are each rather small. The way PHP usually works (at least in the Apache server), there is overhead for each filesystem access: a file pointer and buffer is opened and maintained for each file. Since there's no code samples to review here, it's hard to see where inefficiencies are.

However, using file_put_contents() for 300,000+ files appears to be slightly less efficient than using fopen() and fwrite() or fflush() directly, then fclose() when you're done. I'm saying that based on a benchmark done by a fellow in the comments of the PHP documentation for file_put_contents() at http://php.net/manual/en/function.file-put-contents.php#105421 Next, when dealing with such small file sizes, it sounds like there's a great opportunity to use a database instead of flat files (I'm sure you've got that before). A database, whether mySQL or PostgreSQL, is highly optimized for simultaneous access to many records, and can internally balance CPU workload in ways that filesystem access never can (and binary data in records is possible too). Unless you need access to real files directly from your server hard drives, a database can simulate many files by allowing PHP to return individual records as file data over the web (i.e., by using the header() function). Again, I'm assuming this PHP is running as a web interface on a server.

Overall, what I am reading suggests that there may be an inefficiency somewhere else besides filesystem access. How is the file content generated? How does the operating system handle file access? Is there compression or encryption involved? Are these images or text data? Is the OS writing to one hard drive, a software RAID array, or some other layout? Those are some of the questions I can think of just glancing over your problem. Hopefully my answer helped. Cheers.

Upvotes: 7

Related Questions