Jamie Lee
Jamie Lee

Reputation: 83

Python define duplicate rules

I have the following table:

index col_A col_B

0  John   Jones
1  Perez  Mark   
2  Jones  John   
3  Mark   Perez
4  Albert Perez

Rows 0 and 2 have the column values switched as well as rows 1 and 3. I am looking for a way to recognize rows 0 and 2 as duplicates and rows 1 and 3 as duplicates as well. Then I would like to drop the duplicates and keep only one of the unique occurrences. So the resulting matrix should be a 3 by 2 (since row 4 has no duplicates). Is there an efficient way to do this without using nested loops(my table has more than 3 millions rows)? Thanks

Upvotes: 0

Views: 109

Answers (2)

kvorobiev
kvorobiev

Reputation: 5070

You could use a set to remember already added elements. Something like

a = [(0, 'John', 'Jones'), (1, 'Perez', 'Mark'), (2, 'Jones', 'John'), (3, 'Mark', 'Perez'), (4, 'Albert', 'Perez')]

res = []

already_added = set()

for e in a:
    key1, key2 = e[1]+e[2], e[2]+e[1]
    if key1 not in already_added and key2 not in already_added:
        res.append(e)
        already_added.add(key1)

res
Out[37]: [(0, 'John', 'Jones'), (1, 'Perez', 'Mark'), (4, 'Albert', 'Perez')]

Upvotes: 1

Perseus784
Perseus784

Reputation: 104

A simple and most efficient way is to create a DBMS. In sqlite3, something like this could work:

DELETE FROM table WHERE id NOT IN (SELECT max(id) FROM table GROUP BY col_A,col_B)

Upvotes: 0

Related Questions