Reputation: 195
Many thanks for reading.
I have a pandas dataframe of roughly 200,000 rows and 46 columns. 23 of these columns end in "_1" and the other 23 end in "_2". For example:
forename_1 surname_1 area_1 forename_2 surname_2 area_2
george neil g jim bob k
charlie david s graham josh l
pete keith k dan joe q
ben steve w richard ed p
jim bob k george neil g
dan joe q pete keith k
I have successfully removed duplicates using drop_duplicates, but now want to remove rows that are duplicates but the group they are in (1 or 2) has been inverted.
That is, for one row, I want to compare the combined values in forename_1, surname_1 and area_1 with the combined values in forename_2, surname_2 and area_2 for all other rows.
I would want to remove the second 'duplicate' out of the two (e.g. keep='first').
To help explain, there are two cases above where a duplicate would need to removed:
george neil g jim bob k
jim bob k george neil g
pete keith k dan joe q
dan joe q pete keith k
In each case, the second row of the two would be removed, meaning my expected output would be:
forename_1 surname_1 area_1 forename_2 surname_2 area_2
george neil g jim bob k
charlie david s graham josh l
pete keith k dan joe q
ben steve w richard ed p
I have seen an answer that deals with this in R, but is there also a way that this can be done in Python?
Compare group of two columns and return index matches R
Remove duplicates where values are swapped across 2 columns in R
Many thanks.
Upvotes: 5
Views: 1271
Reputation: 2553
There may be a better solution, but here's one splitting and recombining the dataframe to remove duplicates, and then doing the opposite operation leading back to the original format:
In [43]: df
Out[43]:
forename_1 surname_1 area_1 forename_2 surname_2 area_2
0 george neil g jim bob k
1 charlie david s graham josh l
2 pete keith k dan joe q
3 ben steve w richard ed p
4 jim bob k george neil g
5 dan joe q pete keith k
Let's label the rows, so we can merge them back together properly later on:
In [57]: df['index'] = df.index
Now we split the dataframe, and rename the columns:
In [59]: df_1 = df[['forename_1', 'surname_1', 'area_1', 'index']]
In [60]: df_2 = df[['forename_2', 'surname_2', 'area_2', 'index']]
In [61]: df_1.columns = ['forename', 'surname', 'area', 'index']
In [62]: df_2.columns = ['forename', 'surname', 'area', 'index']
In [63]: df_1['source'] = 1
In [64]: df_2['source'] = 2
Let's merge the datasets, and drop duplicates (thanks to the "index" sorting, we keep the first value).
In [67]: df = pd.concat([df_1, df_2])
In [68]: df
Out[68]:
forename surname area index source
0 george neil g 0 1
1 charlie david s 1 1
2 pete keith k 2 1
3 ben steve w 3 1
4 jim bob k 4 1
5 dan joe q 5 1
0 jim bob k 0 2
1 graham josh l 1 2
2 dan joe q 2 2
3 richard ed p 3 2
4 george neil g 4 2
5 pete keith k 5 2
In [71]: out = df.sort_values(['index']).drop_duplicates(['forename', 'surname', 'area'], keep='first')
In [72]: out
Out[72]:
forename surname area index source
0 george neil g 0 1
0 jim bob k 0 2
1 charlie david s 1 1
1 graham josh l 1 2
2 pete keith k 2 1
2 dan joe q 2 2
3 ben steve w 3 1
3 richard ed p 3 2
Lookin' good, the unwanted rows went away! Now we merge everything back together (depending on your use case, you may need to use a different type of join, please refer to the merge documentation for that):
In [76]: df_1_out = out[out['source'] == 1][['forename', 'surname', 'area', 'index']]
In [77]: df_2_out = out[out['source'] == 2][['forename', 'surname', 'area', 'index']]
In [82]: df_1_out.merge(df_2_out, on='index', suffixes=('_1', '_2')).drop('index', axis=1)
Out[82]:
forename_1 surname_1 area_1 forename_2 surname_2 area_2
0 george neil g jim bob k
1 charlie david s graham josh l
2 pete keith k dan joe q
3 ben steve w richard ed p
Which is the expected result!
Upvotes: 3