Matthew Kaplan
Matthew Kaplan

Reputation: 119

Moving duplicates in a Pandas dataframe to a new dataframe

I am trying to sort through a pandas dataframe and find duplicates.

enter image description here

However, I am not just trying to locate duplicates and get rid of them. I need to see exactly which two(or more) file numbers contain the same EIN, and move that over to a new data frame.

For example, if file_num 376, and 7212 contain the exact same EIN (12370123723), I'd like to create a dataframe that looks something like this:

EIN:            file_num
12370123723     376, 7212

If anyone has any suggestions as to how to do something like this, any feedback would be appreciated. I tried using the .duplicated() method, but this only returns Bools and doesn't tell me exactly which files are duplicates of which.

Upvotes: 0

Views: 895

Answers (1)

Roy2012
Roy2012

Reputation: 12503

Do the following:

dups = df[df.EIN.duplicated(keep=False)]
dups.groupby("EIN")["file_num"].apply(list)

These are the results for synthetic data:

Data:

   EIN  file_num
0    2         0
1    5         1
2    0         2
3    5         3
4    5         4
5    5         5
6    6         6
7    0         7
8    2         8
9    3         9

Output:

EIN
0          [2, 7]
2          [0, 8]
5    [1, 3, 4, 5]

Upvotes: 1

Related Questions