Reputation: 509
I have a pandas dataframe and I am trying to make cluster according to below dict. example: In 'Info 1' cluster I have total 7 values as per dictionary and in panda dataframe I have only 4 in that. making cluster accrding to that. I will get below output.
INPUT:
PII Counts
CREDIT_CARD 158
DATE_TIME 544
DOMAIN_NAME 609
EMAIL_ADDRESS 90
IP_ADDRESS 405
LOCATION 346
PERSON 202
BANK_NUMBER 202
PASSPORT 130
NHS 6
NRP 20
dict = {'Info 1': ['PERSON', 'LOCATION', 'PHONE_NUMBER', 'EMAIL_ADDRESS', 'PASSPORT', 'SSN',
'DRIVER_LICENSE'],
'Info 2': ['NHS'],
'Info 3': ['IP_ADDRESS', 'DOMAIN_NAME'],
'Info 4': ['CRYPTO', 'DATE_TIME', 'NRP'],
'Info 5': ['CREDIT_CARD', 'BANK_NUMBER', 'ITIN', 'CODE']}
OUTPUT :
Names Count Info
0 Info 5 [158, 202] ['CREDIT_CARD','BANK_NUMBER']
1 Info 2 [6] ['NHS']
2 Info 3 [405, 609] ['IP_ADDRESS','DOMAIN_NAME']
3 Info 4 [20, 544] ['NRP','DATE_TIME']
4 Info 1 [202, 346, 90, 130] ['PERSON','LOCATION','EMAIL_ADDRESS','PASSPORT']
Upvotes: 1
Views: 321
Reputation: 862581
First dont use variable dict
, because python code variable.
Then flatten lists of dict with swapped keys and values, use Series.map
by PII
and pass to DataFrame.groupby
with aggregate list
:
d = {'Info 1': ['PERSON', 'LOCATION', 'PHONE_NUMBER', 'EMAIL_ADDRESS', 'PASSPORT', 'SSN',
'DRIVER_LICENSE'],
'Info 2': ['NHS'],
'Info 3': ['IP_ADDRESS', 'DOMAIN_NAME'],
'Info 4': ['CRYPTO', 'DATE_TIME', 'NRP'],
'Info 5': ['CREDIT_CARD', 'BANK_NUMBER', 'ITIN', 'CODE']}
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
df1 = (df.groupby(df['PII'].map(d1).rename('Names'), sort=False)
.agg(list)
.reset_index())
print (df1)
Names PII Counts
0 Info 5 [CREDIT_CARD, BANK_NUMBER] [158, 202]
1 Info 4 [DATE_TIME, NRP] [544, 20]
2 Info 3 [DOMAIN_NAME, IP_ADDRESS] [609, 405]
3 Info 1 [EMAIL_ADDRESS, LOCATION, PERSON, PASSPORT] [90, 346, 202, 130]
4 Info 2 [NHS] [6]
Upvotes: 2