LookIntoEast
LookIntoEast

Reputation: 8808

Is .append() time-consuming?

I've been manipulating huge text files these days. Sometimes I need to delete lines. My way of doing is like below:

f=open('txt','r').readlines()
list=[]
for line in f:
    if blablablabla:
       list.append(line)

I know for large files, .readlines()is rate-limiting step, but what about .append() step? Does append cost lots of extra time after readlines? If so, maybe I should find way to directly delete lines I don't want, instead of appending lines I want.

thx

Upvotes: 1

Views: 2061

Answers (5)

hughdbrown
hughdbrown

Reputation: 49033

Maybe you want to pull it all into memory and then operate on it. Maybe it makes more sense to operate on one line at a time. It's not clear from your explanation which is better.

In any event, here is pretty standard code for whichever approach you take:

# Pull one line into memory at a time
with open('txt','r') as f:
    lineiter = (line for line in f if blablablabla)
    for line in lineiter:
        # Do stuff

# Read the whole file into memory then work on it
with open('txt','r') as f:
    lineiter = (line for line in f if blablablabla)
    mylines = [line for line in lineiter]

If you go the former route, I recommend that you read up on generators. Dave Beazley has an awesome article on generators called "Generator Tricks for Systems Programmers". Highly recommended.

Upvotes: 0

Sunjay Varma
Sunjay Varma

Reputation: 5115

In this post, I tried to explain the way lists work and why append is not very expensive. I also posted a solution on the bottom which you could use to delete lines.

The structure of Python's lists is like a node network:

>>> class ListItem:
        def __init__(self, value, next=None):
            self.value = value
            self.next = next
        def __repr__(self):
            return "Item: %s"%self.value


>>> ListItem("a", ListItem("b", ListItem("c")))
Item: a
>>> mylist = ListItem("a", ListItem("b", ListItem("c")))
>>> mylist.next.next
Item: c

Therefore, append is basically just this:

ListItem(mynewvalue, oldlistitem)

Append doesn't have much overhead, but insert() on the other hand requires you to reconstruct the whole list, and will therefore take much more time.

>>> from timeit import timeit
>>> timeit('a=[]\nfor i in range(100): a.append(i)', number=1000)
0.03651859015577941
>>> timeit('a=[]\nfor i in range(100): a.insert(0, i)', number=1000)
0.047090002177625934
>>> timeit('a=[]\nfor i in range(100): a.append(i)', number=10000)
0.18015429656996673
>>> timeit('a=[]\nfor i in range(100): a.insert(0, i)', number=10000)
0.35550057300308424

As you can see, insert is much slower. If I were you, I would just eliminate the lines you don't need, by writing them back right away.

with open("large.txt", "r") as fin:
    with open("large.txt", "w") as f:
        for line in fin:
            if myfancyconditionismet:
                # write the line to the file again
                f.write(line + "\n")
            # otherwise it is gone

There is my explanation and solution.

-Sunjay03

Upvotes: 1

John La Rooy
John La Rooy

Reputation: 304355

You should use a list comprehension instead as in Jeff's answer. Depending on how you need to process the data, you may be able to use a generator expression instead.

To answer your question about append()

Python lists are preallocated with some extra space at the end. This means that append is very fast - until you run out of preallocated space. Whenever the list is extended, a new block of memory is allocated and all the references copied over to it. As the list grows, so does the size of the extra preallocated space. This is done so that append is amortized O(1). ie the average time for append is fast and constant

Upvotes: 1

Fredrik Pihl
Fredrik Pihl

Reputation: 45672

As a general hint, do this instead, no need to read the complete file first before iterating through it...

with open('txt') as fd:
    for line in fd:
        if blablabla:
            my_list.append(line)

and don't call a list "list"...

Upvotes: 2

Jeff Mercado
Jeff Mercado

Reputation: 134981

Why read the whole entire file in using readlines() if you're going to filter it later? Just iterate through it saving the lines you want to keep. You could reduce this down to a couple of lines using list comprehension instead:

with open('txt', 'r') as f:
    myList = [ line for line in f if blablablabla ]

Upvotes: 5

Related Questions