Reputation: 119
I am trying to sort through a pandas dataframe and find duplicates.
However, I am not just trying to locate duplicates and get rid of them. I need to see exactly which two(or more) file numbers contain the same EIN, and move that over to a new data frame.
For example, if file_num 376, and 7212 contain the exact same EIN (12370123723), I'd like to create a dataframe that looks something like this:
EIN: file_num
12370123723 376, 7212
If anyone has any suggestions as to how to do something like this, any feedback would be appreciated. I tried using the .duplicated() method, but this only returns Bools and doesn't tell me exactly which files are duplicates of which.
Upvotes: 0
Views: 895
Reputation: 12503
Do the following:
dups = df[df.EIN.duplicated(keep=False)]
dups.groupby("EIN")["file_num"].apply(list)
These are the results for synthetic data:
Data:
EIN file_num
0 2 0
1 5 1
2 0 2
3 5 3
4 5 4
5 5 5
6 6 6
7 0 7
8 2 8
9 3 9
Output:
EIN
0 [2, 7]
2 [0, 8]
5 [1, 3, 4, 5]
Upvotes: 1