Reputation: 17
so, I got a .txt file with values separated by an semicolon. What I wanna do is extract the first column, add the values to a list but don't add duplicates. What I came up with is:
values = []
with open(filename, 'r') as file:
data = file.readlines()
for line in data:
tmpVal = line.split(';')[0]
if tmpVal not in values:
values.append(tmpVal)
Though the file is somewhat big (~706 MB), this script is running very slow (actually runs for about 10 minutes now).
Can someone point out where I can improve my code?
Thanks a million, Jerome
Upvotes: 0
Views: 53
Reputation: 112
a possible improvement you can use is using a set instead of a list for values. this will deprecate the if tmpVal not in values
line, which is an o(n) operation (expensive!). your code will be:
values = set()
with open(filename, 'r') as file:
data = file.readlines()
for line in data:
tmpVal = line.split(';')[0]
if tmpVal not in values:
values.add(tmpVal)
and to make it more pythonic:
with open(filename, 'r') as f:
return set(line.split(';')[0] for line in file.readlines())
or on newer versions of python (using set comprehension):
with open(filename, 'r') as f:
return {line.split(';')[0] for line in file.readlines()}
Upvotes: 1
Reputation: 4462
Use set
values = set()
with open(filename, 'r') as file:
for line in file:
tmpVal = line.split(';')[0]
values.add(tmpVal)
Upvotes: 1
Reputation: 4422
Use a set instead of a list for values. Set membership checking will be a lot faster.
values = set()
Don't use readlines(). Just iterate thru the file itself.
Upvotes: 2