Joseph0210
Joseph0210

Reputation: 195

Remove duplicate rows where values swapped across columns

Many thanks for reading.

I have a pandas dataframe of roughly 200,000 rows and 46 columns. 23 of these columns end in "_1" and the other 23 end in "_2". For example:

 forename_1   surname_1   area_1   forename_2   surname_2   area_2
    george       neil       g         jim         bob        k
    charlie      david      s         graham      josh       l
    pete         keith      k         dan         joe        q
    ben          steve      w         richard     ed         p
    jim          bob        k         george      neil       g
    dan          joe        q         pete        keith      k

I have successfully removed duplicates using drop_duplicates, but now want to remove rows that are duplicates but the group they are in (1 or 2) has been inverted.

That is, for one row, I want to compare the combined values in forename_1, surname_1 and area_1 with the combined values in forename_2, surname_2 and area_2 for all other rows.

I would want to remove the second 'duplicate' out of the two (e.g. keep='first').

To help explain, there are two cases above where a duplicate would need to removed:

george       neil       g         jim         bob        k
jim          bob        k         george      neil       g

pete         keith      k         dan         joe        q
dan          joe        q         pete        keith      k

In each case, the second row of the two would be removed, meaning my expected output would be:

  forename_1   surname_1   area_1   forename_2   surname_2   area_2
    george       neil       g         jim         bob        k
    charlie      david      s         graham      josh       l
    pete         keith      k         dan         joe        q
    ben          steve      w         richard     ed         p

I have seen an answer that deals with this in R, but is there also a way that this can be done in Python?

Compare group of two columns and return index matches R

Remove duplicates where values are swapped across 2 columns in R

Many thanks.

Upvotes: 5

Views: 1271

Answers (1)

3kt
3kt

Reputation: 2553

There may be a better solution, but here's one splitting and recombining the dataframe to remove duplicates, and then doing the opposite operation leading back to the original format:

In [43]: df
Out[43]: 
  forename_1 surname_1 area_1 forename_2 surname_2 area_2
0     george      neil      g        jim       bob      k
1    charlie     david      s     graham      josh      l
2       pete     keith      k        dan       joe      q
3        ben     steve      w    richard        ed      p
4        jim       bob      k     george      neil      g
5        dan       joe      q       pete     keith      k

Let's label the rows, so we can merge them back together properly later on:

In [57]: df['index'] = df.index

Now we split the dataframe, and rename the columns:

In [59]: df_1 = df[['forename_1', 'surname_1', 'area_1', 'index']]

In [60]: df_2 = df[['forename_2', 'surname_2', 'area_2', 'index']]

In [61]: df_1.columns = ['forename', 'surname', 'area', 'index']

In [62]: df_2.columns = ['forename', 'surname', 'area', 'index']

In [63]: df_1['source'] = 1

In [64]: df_2['source'] = 2

Let's merge the datasets, and drop duplicates (thanks to the "index" sorting, we keep the first value).

In [67]: df = pd.concat([df_1, df_2])

In [68]: df
Out[68]: 
  forename surname area  index  source
0   george    neil    g      0       1
1  charlie   david    s      1       1
2     pete   keith    k      2       1
3      ben   steve    w      3       1
4      jim     bob    k      4       1
5      dan     joe    q      5       1
0      jim     bob    k      0       2
1   graham    josh    l      1       2
2      dan     joe    q      2       2
3  richard      ed    p      3       2
4   george    neil    g      4       2
5     pete   keith    k      5       2

In [71]: out = df.sort_values(['index']).drop_duplicates(['forename', 'surname', 'area'], keep='first')

In [72]: out  
Out[72]: 
  forename surname area  index  source
0   george    neil    g      0       1
0      jim     bob    k      0       2
1  charlie   david    s      1       1
1   graham    josh    l      1       2
2     pete   keith    k      2       1
2      dan     joe    q      2       2
3      ben   steve    w      3       1
3  richard      ed    p      3       2

Lookin' good, the unwanted rows went away! Now we merge everything back together (depending on your use case, you may need to use a different type of join, please refer to the merge documentation for that):

In [76]: df_1_out = out[out['source'] == 1][['forename', 'surname', 'area', 'index']]

In [77]: df_2_out = out[out['source'] == 2][['forename', 'surname', 'area', 'index']]

In [82]: df_1_out.merge(df_2_out, on='index', suffixes=('_1', '_2')).drop('index', axis=1)
Out[82]: 
  forename_1 surname_1 area_1 forename_2 surname_2 area_2
0     george      neil      g        jim       bob      k
1    charlie     david      s     graham      josh      l
2       pete     keith      k        dan       joe      q
3        ben     steve      w    richard        ed      p

Which is the expected result!

Upvotes: 3

Related Questions