stack flow
stack flow

Reputation: 75

Remove duplicates from large list but remove both if it does exist?

So I have a text file like this

123
1234
123
1234
12345
123456

You can see 123 appears twice so both instances should be removed. but 12345 appears once so it stays. My text file is about 70,000 lines.

Here is what I came up with.

file = open("test.txt",'r')
lines = file.read().splitlines() #to ignore the '\n' and turn to list structure
for appId in lines:
    if(lines.count(appId) > 1):  #if element count is not unique remove both elements
        lines.remove(appId)      #first instance removed
        lines.remove(appId)      #second instance removed


writeFile = open("duplicatesRemoved.txt",'a') #output the left over unique elements to file
for element in lines:
    writeFile.write(element + "\n")

When I run this I feel like my logic is correct, but I know for a fact the output is suppose to be around 950, but Im still getting 23000 elements in my output so a lot is not getting removed. Any ideas where the bug could reside?

Edit: I FORGOT TO MENTION. An element can only appear twice MAX.

Upvotes: 2

Views: 672

Answers (3)

Hamidreza
Hamidreza

Reputation: 1676

You can count all of the elements and store them in a dictionary:

dic = {a:lines.count(a) for a in lines}

Then remove all duplicated one from array:

for k in dic:
    if dic[k]>1:
        while k in lines:
            lines.remove(k)

NOTE: The while loop here is becaues line.remove(k) removes first k value from array and it must be repeated till there's no k value in the array.

If the for loop is complicated, you can use the dictionary in another way to get rid of duplicated values:

lines = [k for k, v in dic.items() if v==1]

Upvotes: 0

Osman Mamun
Osman Mamun

Reputation: 2882

Use Counter from built in collections:

In [1]: from collections import Counter

In [2]: a = [123, 1234, 123, 1234, 12345, 123456]

In [3]: a = Counter(a)

In [4]: a
Out[4]: Counter({123: 2, 1234: 2, 12345: 1, 123456: 1})


In [5]: a = [k for k, v in a.items() if v == 1]

In [6]: a
Out[6]: [12345, 123456]

For your particular problem I will do it like this:

from collections import defaultdict
out = defaultdict(int)
with open('input.txt') as f:
    for line in f:
        out[line.strip()] += 1
with open('out.txt', 'w') as f:
     for k, v in out.items():
         if v == 1: #here you use logic suitable for what you want
             f.write(k + '\n') 

Upvotes: 4

Green Cloak Guy
Green Cloak Guy

Reputation: 24711

Be careful about removing elements from a list while still iterating over that list. This changes the behavior of the list iterator, and can make it skip over elements, which may be part of your problem.

Instead, I suggest creating a filtered copy of the list using a list comprehension - instead of removing elements that appear more than twice, you would keep elements that appear less than that:

file = open("test.txt",'r')
lines = file.read().splitlines()
unique_lines = [line for line in lines if lines.count(line) <= 2]  # if it appears twice or less

with open("duplicatesRemoved.txt", "w") as writefile:
    writefile.writelines(unique_lines)

You could also easily modify this code to look for only one occurrence (if lines.count(line) == 1) or for more than two occurrences.

Upvotes: 2

Related Questions