Reputation: 83
I have the following table:
index col_A col_B
0 John Jones
1 Perez Mark
2 Jones John
3 Mark Perez
4 Albert Perez
Rows 0 and 2 have the column values switched as well as rows 1 and 3. I am looking for a way to recognize rows 0 and 2 as duplicates and rows 1 and 3 as duplicates as well. Then I would like to drop the duplicates and keep only one of the unique occurrences. So the resulting matrix should be a 3 by 2 (since row 4 has no duplicates). Is there an efficient way to do this without using nested loops(my table has more than 3 millions rows)? Thanks
Upvotes: 0
Views: 109
Reputation: 5070
You could use a set
to remember already added elements. Something like
a = [(0, 'John', 'Jones'), (1, 'Perez', 'Mark'), (2, 'Jones', 'John'), (3, 'Mark', 'Perez'), (4, 'Albert', 'Perez')]
res = []
already_added = set()
for e in a:
key1, key2 = e[1]+e[2], e[2]+e[1]
if key1 not in already_added and key2 not in already_added:
res.append(e)
already_added.add(key1)
res
Out[37]: [(0, 'John', 'Jones'), (1, 'Perez', 'Mark'), (4, 'Albert', 'Perez')]
Upvotes: 1
Reputation: 104
A simple and most efficient way is to create a DBMS. In sqlite3, something like this could work:
DELETE FROM table WHERE id NOT IN (SELECT max(id) FROM table GROUP BY col_A,col_B)
Upvotes: 0