Moving duplicates in a Pandas dataframe to a new dataframe

Question

I am trying to sort through a pandas dataframe and find duplicates.

However, I am not just trying to locate duplicates and get rid of them. I need to see exactly which two(or more) file numbers contain the same EIN, and move that over to a new data frame.

For example, if file_num 376, and 7212 contain the exact same EIN (12370123723), I'd like to create a dataframe that looks something like this:

EIN:            file_num
12370123723     376, 7212

If anyone has any suggestions as to how to do something like this, any feedback would be appreciated. I tried using the .duplicated() method, but this only returns Bools and doesn't tell me exactly which files are duplicates of which.

Roy2012 · Accepted Answer

Do the following:

dups = df[df.EIN.duplicated(keep=False)]
dups.groupby("EIN")["file_num"].apply(list)

These are the results for synthetic data:

Data:

   EIN  file_num
0    2         0
1    5         1
2    0         2
3    5         3
4    5         4
5    5         5
6    6         6
7    0         7
8    2         8
9    3         9

Output:

EIN
0          [2, 7]
2          [0, 8]
5    [1, 3, 4, 5]

Moving duplicates in a Pandas dataframe to a new dataframe

Answers (1)

Related Questions