sheel
sheel

Reputation: 509

How to make clusters of pandas data frame according to given dictionary?

I have a pandas dataframe and I am trying to make cluster according to below dict. example: In 'Info 1' cluster I have total 7 values as per dictionary and in panda dataframe I have only 4 in that. making cluster accrding to that. I will get below output.

INPUT:

PII               Counts 
CREDIT_CARD        158
DATE_TIME          544
DOMAIN_NAME        609
EMAIL_ADDRESS      90
IP_ADDRESS         405
LOCATION           346
PERSON             202
BANK_NUMBER        202
PASSPORT           130
NHS                6
NRP                20

dict = {'Info 1': ['PERSON', 'LOCATION', 'PHONE_NUMBER', 'EMAIL_ADDRESS', 'PASSPORT', 'SSN',
                              'DRIVER_LICENSE'],
            'Info 2': ['NHS'],
            'Info 3': ['IP_ADDRESS', 'DOMAIN_NAME'],
             'Info 4': ['CRYPTO', 'DATE_TIME', 'NRP'],
            'Info 5': ['CREDIT_CARD', 'BANK_NUMBER', 'ITIN', 'CODE']}

OUTPUT :

    Names            Count             Info
0   Info 5           [158, 202]        ['CREDIT_CARD','BANK_NUMBER']
1   Info 2                  [6]        ['NHS']
2   Info 3           [405, 609]        ['IP_ADDRESS','DOMAIN_NAME']
3   Info 4            [20, 544]        ['NRP','DATE_TIME']
4   Info 1  [202, 346, 90, 130]        ['PERSON','LOCATION','EMAIL_ADDRESS','PASSPORT']

Upvotes: 1

Views: 321

Answers (1)

jezrael
jezrael

Reputation: 862581

First dont use variable dict, because python code variable.

Then flatten lists of dict with swapped keys and values, use Series.map by PII and pass to DataFrame.groupby with aggregate list:

d = {'Info 1': ['PERSON', 'LOCATION', 'PHONE_NUMBER', 'EMAIL_ADDRESS', 'PASSPORT', 'SSN',
                              'DRIVER_LICENSE'],
            'Info 2': ['NHS'],
            'Info 3': ['IP_ADDRESS', 'DOMAIN_NAME'],
             'Info 4': ['CRYPTO', 'DATE_TIME', 'NRP'],
            'Info 5': ['CREDIT_CARD', 'BANK_NUMBER', 'ITIN', 'CODE']}


d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}

df1 = (df.groupby(df['PII'].map(d1).rename('Names'), sort=False)
         .agg(list)
         .reset_index())
print (df1)
    Names                                          PII               Counts
0  Info 5                   [CREDIT_CARD, BANK_NUMBER]           [158, 202]
1  Info 4                             [DATE_TIME, NRP]            [544, 20]
2  Info 3                    [DOMAIN_NAME, IP_ADDRESS]           [609, 405]
3  Info 1  [EMAIL_ADDRESS, LOCATION, PERSON, PASSPORT]  [90, 346, 202, 130]
4  Info 2                                        [NHS]                  [6]

Upvotes: 2

Related Questions