Reputation: 392
I'm trying to remove duplicate items from a large text file containing 250 million items at 4.4 Gigabytes.
I was impressed to see that I could load this file into a python list in just a few minutes with the following code:
x = []
with open("online.txt") as file:
for l in file:
x.append(l)
print('count of array: ')
print(len(x))
But when I tried to simply check to make sure the next item doesn't exist before added it to an array, it's taking many hours to finish. I feel like I'm missing something simple that would really speed this up.
Here's the code I used to check for duplicate items:
a = []
x = []
with open("online.txt") as file:
for l in file:
if l in a:
print('duplicate')
print(l)
else:
x.append(l.strip())
a.append(l)
print('with duplicates: ');
print(len(a))
print('without duplicates: ')
print(len(x))
This is running on a server with 64 Gigs of ram and modern dual xeon processors.
Upvotes: 1
Views: 70
Reputation: 392
In the end, here's the code I used to remove duplicates:
x = set([])
with open("all.txt") as file:
for l in file:
x.add(l)
print('count of array: ')
print(len(x))
Upvotes: 0
Reputation: 14906
The problem is with a simple list, python has to search through every entry each time before adding a new one.
You could try a python dictionary or a set instead of a list. These data structures are faster for determining if an entry exists already.
Simply change your code:
a = {} # set
x = {}
with open("online.txt") as file:
for l in file:
if l in a:
print('duplicate')
print(l)
else:
x.add(l.strip()) # add to the set
a.add(l)
You don't specify your input file-format, but there may be speed increases possibly by loading the whole data-set into a giant string, then splitting it up with python functions, rather than manually like you do here.
Upvotes: 1