C vs Python I/O performance for lots of small files

Question

In terms of I/O, I'd expect Python and C to have similar performance, but I'm seeing C being from 1.5 to 2 times faster than Python for a similar implementation.

The task is simple: concatenate thousands of ~250 bytes text files, each containing two lines:

Header1 	 Header2 	 ... HeaderN
float1  	 float2  	 ... floatN

The header is the same for all files, so it is read only once and the output file will look like:

Header1 	 Header2 	 ... HeaderN
float1  	 float2  	 ... floatN
float1  	 float2  	 ... floatN
float1  	 float2  	 ... floatN
... thousands of lines
float1  	 float2  	 ... floatN

Here is my implementation in C:

#include 
#include 
#include  
#include 

#define LINE_SIZE 300
#define BUFFER_SZ 5000*LINE_SIZE

void combine(char *fname) {
    DIR *d;
    FILE * fp;
    char line[LINE_SIZE];
    char buffer[BUFFER_SZ];
    short flagHeader = 1;
    buffer[0] = '\0';  // need to init buffer befroe strcat to it

    struct dirent *dir;
    chdir("runs");
    d = opendir(".");
    if (d) {
        while ((dir = readdir(d)) != NULL) {
            if ((strstr(dir->d_name, "Hs")) && (strstr(dir->d_name, ".txt")) ) {
                fp = fopen (dir->d_name, "r");
                fgets(line, LINE_SIZE, fp);  // read first line
                if (flagHeader) {  // append it to buffer only once
                    strcat(buffer, line);
                    flagHeader = 0;
                }
                fgets(line, LINE_SIZE, fp);  // read second line
                strcat(buffer, line);
                fclose(fp);
            }
        }
        closedir(d);
        chdir("..");
        fp = fopen(fname, "w");
        fprintf(fp, buffer);
        fclose(fp);
    }
}

int main() {

    clock_t tc;
    int msec;

    tc = clock(); 
    combine("results_c.txt");
    msec = (clock() - tc) * 1000 / CLOCKS_PER_SEC;
    printf("elapsed time: %d.%ds
", msec/1000, msec%1000);
    return 0;
}

And in Python:

import glob
from time import time


def combine(wildcard, fname='results.txt'):
    """Concatenates all files matching a name pattern into one file.
    Assumes that the files have 2 lines, the first one being the header.
    """
    files = glob.glob(wildcard)
    buffer = ''
    flagHeader = True
    for file in files:
        with open(file, 'r') as pf:
            lines = pf.readlines()
        if not len(lines) == 2:
            print('Error reading file %s. Skipping.' % file)
            continue
        if flagHeader:
            buffer += lines[0]
            flagHeader = False
        buffer += lines[1]

    with open(fname, 'w') as pf:
        pf.write(buffer)


if __name__ == '__main__':
    et = time()
    combine('runs\Hs*.txt')
    et = time() - et
    print("elapsed time: %.3fs" % et)

And a benchmark of 10 runs each - the files are in a local network drive in a busy office, so I guess that explains the variation:

Run 1/10
C      elapsed time: 9.530s
Python elapsed time: 10.225s
===================
Run 2/10
C      elapsed time: 5.378s
Python elapsed time: 10.613s
===================
Run 3/10
C      elapsed time: 6.534s
Python elapsed time: 13.971s
===================
Run 4/10
C      elapsed time: 5.927s
Python elapsed time: 14.181s
===================
Run 5/10
C      elapsed time: 5.981s
Python elapsed time: 9.662s
===================
Run 6/10
C      elapsed time: 4.658s
Python elapsed time: 9.757s
===================
Run 7/10
C      elapsed time: 10.323s
Python elapsed time: 19.032s
===================
Run 8/10
C      elapsed time: 8.236s
Python elapsed time: 18.800s
===================
Run 9/10
C      elapsed time: 7.580s
Python elapsed time: 15.730s
===================
Run 10/10
C      elapsed time: 9.465s
Python elapsed time: 20.532s
===================

Also, a profile run of the python implementation indeed says that 70% of the time is spent with io.open, and the rest with readlines.

In [2]: prun bc.combine('runs\Hs*.txt')
         64850 function calls (64847 primitive calls) in 12.205 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1899    8.391    0.004    8.417    0.004 {built-in method io.open}
     1898    3.322    0.002    3.341    0.002 {method 'readlines' of '_io._IOBase' objects}
        1    0.255    0.255    0.255    0.255 {built-in method nt.listdir}

Even if readlines is extremely slower than fgets, the time spent by python with io.open only is larger than total runtime in C. And also, in the end, both readlines and fgets will read the file line by line, so I'd expect more comparable performance.

So, into my question: in this particular case, why is python so much slower than C for I/O?

Acorn · Accepted Answer

It boils down to a few things:

Most importantly, the Python version is using the text mode (i.e. r and w), which implies handling str (UTF-8) objects instead of bytes.
There are many small files and we do so little with them -- Python's own overhead (e.g. setting up the file objects in open) becomes important.
Python has to dynamically allocate memory for most things.

Also note that I/O in this test is not that relevant if you use local files and do multiple runs, since they will be already cached in memory. The only real I/O will be the final write (and even then, you would have to make sure you are flushing/syncing to disk).

Now, if you take care of the text mode (i.e. using rb and wb) and also you reduce the allocations (less important in this case, but also noticeable), you get something like this:

def combine():
    flagHeader = True
    with open('results-python-new.txt', 'wb') as fout:
        for filename in glob.glob('runs/Hs*.txt'):
            with open(filename, 'rb') as fin:
                header = fin.readline()
                values = fin.readline()
                if flagHeader:
                    flagHeader = False
                    fout.write(header)
                fout.write(values)

Then Python already finishes the tasks in half the time -- actually faster than the C version:

Old C:      0.234
Old Python: 0.389
New Python: 0.213

Possibly you can still improve the time a bit, e.g. by avoiding the glob.

However, if you also apply a couple of similar modifications to the C version, then you will get a much better time -- a third of the time of Python's:

New C:      0.068

Take a look:

#define LINE_SIZE 300

void combine(void) {
    DIR *d;
    FILE *fin;
    FILE *fout;
    struct dirent *dir;
    char headers[LINE_SIZE];
    char values[LINE_SIZE];
    short flagHeader = 1;

    fout = fopen("results-c-new.txt", "wb");
    chdir("runs");
    d = opendir(".");
    if (d) {
        while ((dir = readdir(d)) != NULL) {
            if ((strstr(dir->d_name, "Hs")) && (strstr(dir->d_name, ".txt")) ) {
                fin = fopen(dir->d_name, "rb");
                fgets(headers, LINE_SIZE, fin);
                fgets(values, LINE_SIZE, fin);
                if (flagHeader) {
                    flagHeader = 0;
                    fputs(headers, fout);
                }
                fputs(values, fout);
                fclose(fin);
            }
        }
        closedir(d);
        fclose(fout);
    }
}

C vs Python I/O performance for lots of small files

Answers (1)

Related Questions