artemis
artemis

Reputation: 7281

Mapping duplicate rows to originals with dictionary - Python 3.6

I am trying to locate duplicate rows in my pandas dataframe. In reality, df.shape is 438796, 4531, but I am using this toy example below for an MRE

|   id   | ft1 | ft2 | ft3 | ft4 | ft5 |  label |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|
| id_100 |  1  |  1  |  43 |  1  |  1  |  High  |
| id_101 |  1  |  1  |  33 |  0  |  1  | Medium |
| id_102 |  1  |  1  |  12 |  1  |  1  |   Low  |
| id_103 |  1  |  1  |  46 |  1  |  0  |   Low  |
| id_104 |  1  |  1  |  10 |  1  |  1  |  High  |
| id_105 |  0  |  1  |  99 |  0  |  1  |   Low  |
| id_106 |  0  |  0  |  0  |  0  |  0  |  High  |
| id_107 |  1  |  1  |  6  |  0  |  1  |  High  |
| id_108 |  1  |  1  |  29 |  1  |  1  | Medium |
| id_109 |  1  |  0  |  27 |  0  |  0  | Medium |
| id_110 |  0  |  1  |  32 |  0  |  1  |  High  |

What I am trying to accomplish is observing a subset of the features, and if there are duplicate rows, to keep the first and then denote which id: label pair is the duplicate.

I have looked at the following posts:

I know pandas has a duplicated() call. So I tried implementing that and it sort of works:

import pandas as pd

# Read in example data
df = pd.read_clipboard()

# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']

# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]

# Create a list of duplicates
dupes = sub_df.index[sub_df.duplicated(keep='first')].tolist()

# Loop through the duplicates and print out the values I want
for idx in dupes:
#    print(df[:idx])
    print(df.loc[[idx],['id', 'label']])

However, what I am trying to do is for a particular row, determine which rows are duplicates of it by saving those rows as id: label combination. So while I'm able to extract the id and label for each duplicate, I have no ability to map it back to the original row for which it is a duplicate.

An ideal dataset would look like:

|   id   | ft1 | ft2 | ft3 | ft4 | ft5 |  label |                  duplicates                 |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|:-------------------------------------------:|
| id_100 |  1  |  1  |  43 |  1  |  1  |  High  | {id_102: Low, id_104: High, id_108: Medium} |
| id_101 |  1  |  1  |  33 |  0  |  1  | Medium |                {id_107: High}               |
| id_102 |  1  |  1  |  12 |  1  |  1  |   Low  |                                             |
| id_103 |  1  |  1  |  46 |  1  |  0  |   Low  |                                             |
| id_104 |  1  |  1  |  10 |  1  |  1  |  High  |                                             |
| id_105 |  0  |  1  |  99 |  0  |  1  |   Low  |                {id_110: High}               |
| id_106 |  0  |  0  |  0  |  0  |  0  |  High  |                                             |
| id_107 |  1  |  1  |  6  |  0  |  1  |  High  |                                             |
| id_108 |  1  |  1  |  29 |  1  |  1  | Medium |                                             |
| id_109 |  1  |  0  |  27 |  0  |  0  | Medium |                                             |
| id_110 |  0  |  1  |  32 |  0  |  1  |  High  |                                             |

How can I take my duplicated values and map them back to their originals efficiently (understanding the size of my actual dataset)?

Upvotes: 2

Views: 564

Answers (1)

jezrael
jezrael

Reputation: 863166

Working with dictionaries in columns is really complicated, here is one possible solution:

# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']

# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]

#mask for first dupes
m = sub_df.duplicated()
#create tuples, aggregate to list of tuples
s = (df.assign(a = df[['id','label']].apply(tuple, 1))[m]
       .groupby(cols)['a']
       .agg(lambda x: dict(list(x))))

#add new column
df = df.join(s.rename('duplicates'), on=cols)
#repalce missing values and not first duplciates to empty strings
df['duplicates'] = df['duplicates'].fillna('').mask(m, '')

print (df)

        id  ft1  ft2  ft3  ft4  ft5   label  \
0   id_100    1    1   43    1    1    High   
1   id_101    1    1   33    0    1  Medium   
2   id_102    1    1   12    1    1     Low   
3   id_103    1    1   46    1    0     Low   
4   id_104    1    1   10    1    1    High   
5   id_105    0    1   99    0    1     Low   
6   id_106    0    0    0    0    0    High   
7   id_107    1    1    6    0    1    High   
8   id_108    1    1   29    1    1  Medium   
9   id_109    1    0   27    0    0  Medium   
10  id_110    0    1   32    0    1    High   

                                           duplicates  
0   {'id_102': 'Low', 'id_104': 'High', 'id_108': ...  
1                                  {'id_107': 'High'}  
2                                                      
3                                                      
4                                                      
5                                  {'id_110': 'High'}  
6                                                      
7                                                      
8                                                      
9                                                      
10                   

Alternative with custom function for assign all dupes without first one to first value of new column per groups, last is changed mask for replace empty strings:

# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']

m = ~df.duplicated(subset=cols)  & df.duplicated(subset=cols, keep=False)

def f(x):
    x.loc[x.index[0], 'duplicated'] = [dict(x[['id','label']].to_numpy()[1:])]
    return x

df = df.groupby(cols).apply(f)
df['duplicated'] = df['duplicated'].where(m, '')

print (df)
        id  ft1  ft2  ft3  ft4  ft5   label  \
0   id_100    1    1   43    1    1    High   
1   id_101    1    1   33    0    1  Medium   
2   id_102    1    1   12    1    1     Low   
3   id_103    1    1   46    1    0     Low   
4   id_104    1    1   10    1    1    High   
5   id_105    0    1   99    0    1     Low   
6   id_106    0    0    0    0    0    High   
7   id_107    1    1    6    0    1    High   
8   id_108    1    1   29    1    1  Medium   
9   id_109    1    0   27    0    0  Medium   
10  id_110    0    1   32    0    1    High   

                                           duplicated  
0   {'id_102': 'Low', 'id_104': 'High', 'id_108': ...  
1                                  {'id_107': 'High'}  
2                                                      
3                                                      
4                                                      
5                                  {'id_110': 'High'}  
6                                                      
7                                                      
8                                                      
9                                                      
10                                                     

Upvotes: 2

Related Questions