Remove duplicate rows where values swapped across columns

Question

Many thanks for reading.

I have a pandas dataframe of roughly 200,000 rows and 46 columns. 23 of these columns end in "_1" and the other 23 end in "_2". For example:

 forename_1   surname_1   area_1   forename_2   surname_2   area_2
    george       neil       g         jim         bob        k
    charlie      david      s         graham      josh       l
    pete         keith      k         dan         joe        q
    ben          steve      w         richard     ed         p
    jim          bob        k         george      neil       g
    dan          joe        q         pete        keith      k

I have successfully removed duplicates using drop_duplicates, but now want to remove rows that are duplicates but the group they are in (1 or 2) has been inverted.

That is, for one row, I want to compare the combined values in forename_1, surname_1 and area_1 with the combined values in forename_2, surname_2 and area_2 for all other rows.

I would want to remove the second 'duplicate' out of the two (e.g. keep='first').

To help explain, there are two cases above where a duplicate would need to removed:

george       neil       g         jim         bob        k
jim          bob        k         george      neil       g

pete         keith      k         dan         joe        q
dan          joe        q         pete        keith      k

In each case, the second row of the two would be removed, meaning my expected output would be:

  forename_1   surname_1   area_1   forename_2   surname_2   area_2
    george       neil       g         jim         bob        k
    charlie      david      s         graham      josh       l
    pete         keith      k         dan         joe        q
    ben          steve      w         richard     ed         p

I have seen an answer that deals with this in R, but is there also a way that this can be done in Python?

Compare group of two columns and return index matches R

Remove duplicates where values are swapped across 2 columns in R

Many thanks.

Remove duplicate rows where values swapped across columns

Answers (1)

Related Questions