onlyf
onlyf

Reputation: 883

Remove duplicates based on substring in Python

i have the following code for detecting duplicates in a file and outputing them in 3 separate files, one for non-duplicates, one for duplicates (x2) and one for duplicates (> x2). The first file, holds only lines that had no duplicates in the original file. (It doesnt remove any duplicate lines found, it keeps singles).

import os
import sys
import time
import collections


file_in = sys.argv[1]
file_ot = str(file_in) + ".proc"
file_ot2 = str(file_in) + ".proc2"
file_ot3 = str(file_in) + ".proc3"


counter = 0        

dict_in = collections.defaultdict(list)  
with open(file_in, "r") as f:  
    for line in f:  
        #print("read line: " + str(line))
        counter += 1
        fixed_line = line.strip()
        line_list = fixed_line.split(";")
        key = line_list[0][:12]
        print(":Key: " + str(key))
        dict_in[key].append(line)


with open(file_ot, "w") as f1, open(file_ot2, "w") as f2, open(file_ot3, "w") as f3:
    selector = {1: f1, 2: f2}  
    for values in dict_in.values():  
        if len(values) == 1:
            f1.writelines(values)
        elif len(values) == 2:
            f2.writelines(values)
        else:
            f3.writelines(values)



print("Read: " + str(counter) + " lines")

The above code works, but for v large files (~1g) it takes about ten minutes to chomp through them on my system. I was wondering if there was a way to optimize the speed of this code, or any suggestions in that direction. Thank you in advance!

Input data example:

0000AAAAAAAA;X;;X;
0000AAAAAAAA;X;X;;
0000BBBBBBBB;X;;;
0000CCCCCCCC;;X;;
0000DDDDDDDD;X;;X;
0000DDDDDDDD;X;X;;
0000DDDDDDDD;X;X;X;X
0000EEEEEEEE;X;X;X;X
0000FFFFFFFF;X;;;
0000GGGGGGGG;X;;X;
0000HHHHHHHH;X;X;;
0000JJJJJJJJ;X;X;;

Expected output:

FILE1:
0000BBBBBBBB;X;;;
0000CCCCCCCC;;X;;
0000EEEEEEEE;X;X;X;X
0000FFFFFFFF;X;;;
0000GGGGGGGG;X;;X;
0000HHHHHHHH;X;X;;
0000JJJJJJJJ;X;X;;

FILE2:
0000AAAAAAAA;X;;X;
0000AAAAAAAA;X;X;;

FILE3:
0000DDDDDDDD;X;;X;
0000DDDDDDDD;X;X;;
0000DDDDDDDD;X;X;X;X

Upvotes: 1

Views: 469

Answers (1)

Gaming.ingrs
Gaming.ingrs

Reputation: 281

I used 543MB of random text file to test it.

import time

myList = []

start = time.time()
with open("myFile.txt") as f:
    for line in f:
        line = line.replace("\n","")
        myList.insert(len(myList), line)

with open("dupListaOne.txt", "w") as f1, open ("dupListMore.txt","w") as f2, open("UniqueList.txt","w") as f3:
    new_list = sorted(set(myList))
    for i in range(len(new_list)):
            a = myList.count(new_list[i])
            if ((a-1) == 1):
                f1.write("%s\n" % new_list[i] + " " + str(a-1))
            elif ((a-1) > 1):
                f2.write("%s\n" % new_list[i] + " " + str(a-1))
            else:
                f3.write("%s\n" % new_list[i] + " " + str(a-1))
end = time.time()
print("Time: ",end - start)

f1.close()
f2.close()
f3.close()

Elapsed time: 123.82529425621033 sec. ~ 2 min.

Upvotes: 2

Related Questions