Reputation: 17

Python script perfomance issue: Read text file and remove duplicates

so, I got a .txt file with values separated by an semicolon. What I wanna do is extract the first column, add the values to a list but don't add duplicates. What I came up with is:

values = []
with open(filename, 'r') as file:
    data = file.readlines()
    for line in data:
        tmpVal = line.split(';')[0]
        if tmpVal not in values:
            values.append(tmpVal)

Though the file is somewhat big (~706 MB), this script is running very slow (actually runs for about 10 minutes now).

Can someone point out where I can improve my code?

Thanks a million, Jerome

Upvotes: 0

Answers (3)

Gal Bashan

Reputation: 112

a possible improvement you can use is using a set instead of a list for values. this will deprecate the if tmpVal not in values line, which is an o(n) operation (expensive!). your code will be:

values = set()
with open(filename, 'r') as file:
    data = file.readlines()
    for line in data:
        tmpVal = line.split(';')[0]
        if tmpVal not in values:
            values.add(tmpVal)

and to make it more pythonic:

with open(filename, 'r') as f:
    return set(line.split(';')[0] for line in file.readlines())

or on newer versions of python (using set comprehension):

with open(filename, 'r') as f:
    return {line.split(';')[0] for line in file.readlines()}

Upvotes: 1

Michael Kazarian

Reputation: 4462

Use set

values = set()
with open(filename, 'r') as file:
    for line in file:
        tmpVal = line.split(';')[0]
        values.add(tmpVal)

Upvotes: 1

sureshvv

Reputation: 4422

Use a set instead of a list for values. Set membership checking will be a lot faster.
```
values = set()
```
Don't use readlines(). Just iterate thru the file itself.

Upvotes: 2

Python script perfomance issue: Read text file and remove duplicates

Answers (3)

Related Questions