rassa45
rassa45

Reputation: 3550

Why does copying a file line by line greatly affect copy speed in Python?

A while ago, I made a Python script which looked similar to this:

with open("somefile.txt", "r") as f, open("otherfile.txt", "a") as w:
    for line in f:
        w.write(line)

Which, of course, worked pretty slowly on a 100mb file.

However, I changed the program to do this

ls = []
with open("somefile.txt", "r") as f, open("otherfile.txt", "a") as w:
    for line in f:
        ls.append(line)
        if len(ls) == 100000:
            w.writelines(ls)
            del ls[:]

And the file copied much faster. My question is, why does the second method work faster even though the program copies the same number of lines (albeit collects them and prints them one by one)?

Upvotes: 10

Views: 467

Answers (3)

Kasravnd
Kasravnd

Reputation: 107287

That's because of that in first part you have to call the method write for all the lines in each iteration which makes your program take much time to run. But in second code although your waste more memory but it performs better because you have called the writelines() method each 100000 line.

Let see this is source,this is the source of writelines function :

def writelines(self, list_of_data):
    """Write a list (or any iterable) of data bytes to the transport.

    The default implementation concatenates the arguments and
    calls write() on the result.
    """
    if not _PY34:
        # In Python 3.3, bytes.join() doesn't handle memoryview.
        list_of_data = (
            bytes(data) if isinstance(data, memoryview) else data
            for data in list_of_data)
    self.write(b''.join(list_of_data))

As you can see it joins all the list items and calls the write function one time.

Note that joining the data here takes time but its less than the time for calling the write function for each line.But since you use python 3.4 in ,it writes the lines one at a time rather than joining them so it would be much faster than write in this case :

  • cStringIO.writelines() now accepts any iterable argument and writes the lines one at a time rather than joining them and writing once. Made a parallel change to StringIO.writelines(). Saves memory and makes suitable for use with generator expressions.

Upvotes: -1

SD.
SD.

Reputation: 9549

I do not agree with the other answer here.

It is simply a coincidence. It highly depends on your environment:

  • What OS?
  • What HDD/CPU?
  • What HDD file system format?
  • How busy is your CPU/HDD?
  • What Python version?

Both pieces of code do the absolute same thing with tiny differences in performance.

For me personally .writelines() takes longer to execute then your first example using .write(). Tested with 110MB text file.

I will not post my machine specs on purpose.

Test .write(): ------copying took 0.934000015259 seconds (dashes for readability)

Test .writelines(): copying took 0.936999797821 seconds

Also tested with small and as large as 1.5GB files with the same results. (writelines always beeing slightly slower, up to 0.5sec difference for 1.5GB file).

Upvotes: 0

Brobin
Brobin

Reputation: 3326

I may have found a reason why write is slower than writelines. In looking through the CPython source (3.4.3) I found the code for the write function (took out irrelevent parts).

Modules/_io/fileio.c

static PyObject *
fileio_write(fileio *self, PyObject *args)
{
    Py_buffer pbuf;
    Py_ssize_t n, len;
    int err;
    ...
    n = write(self->fd, pbuf.buf, len);
    ...

    PyBuffer_Release(&pbuf);

    if (n < 0) {
        if (err == EAGAIN)
            Py_RETURN_NONE;
        errno = err;
        PyErr_SetFromErrno(PyExc_IOError);
        return NULL;
    }

    return PyLong_FromSsize_t(n);
}

If you notice, this function actually returns a value, the size of the string that has been written, which is another function call.

I tested this out to see if it actually had a return value, and it did.

with open('test.txt', 'w+') as f:
    x = f.write("hello")
    print(x)

>>> 5

The following is the code for the writelines function implementation in CPython (took out irrelevent parts).

Modules/_io/iobase.c

static PyObject *
iobase_writelines(PyObject *self, PyObject *args)
{
    PyObject *lines, *iter, *res;

    ...

    while (1) {
        PyObject *line = PyIter_Next(iter);
        ...
        res = NULL;
        do {
            res = PyObject_CallMethodObjArgs(self, _PyIO_str_write, line, NULL);
        } while (res == NULL && _PyIO_trap_eintr());
        Py_DECREF(line);
        if (res == NULL) {
            Py_DECREF(iter);
            return NULL;
        }
        Py_DECREF(res);
    }
    Py_DECREF(iter);
    Py_RETURN_NONE;
}

If you notice, there is no return value! It simply has Py_RETURN_NONE instead of another function call to calculate the size of the written value.

So, I went ahead and tested that there really wasn't a return value.

with open('test.txt', 'w+') as f:
    x = f.writelines(["hello", "hello"])
    print(x)

>>> None

The extra time that write takes seems to be due to the extra function call taken in the implementation to produce the return value. By using writelines, you skip that step and the fileio is the only bottleneck.

Edit: write documentation

Upvotes: 2

Related Questions