Reputation: 509
I am trying to make a cluster of the following pandas data frame and trying to give the names. E.g - "Personal Info" is cluster name and it consist of (PERSON,LOCATION,PHONE_NUMBER,EMAIL_ADDRESS,PASSPORT,SSN, DRIVER_LICENSE) and also addition of there Counts. which will be 460.
Clusters:
for reference I am providing clusters structure
Input data:
Names Counts
CREDIT_CARD 10
CRYPTO 20
DATE_TIME 28
DOMAIN_NAME 40
EMAIL_ADDRESS 45
IBAN_CODE 20
IP_ADDRESS 100
NRP 38
LOCATION 36
PERSON 90
PHONE_NUMBER 105
BANK_NUMBER 29
DRIVER_LICENSE 45
ITIN 38
PASSPORT 49
SSN 90
NHS 0
Output:
Cluster names Total count
Personal Info (90+36+105+45+49+90) = 460
Finance (10+29+38+20) = 97
Network (100+40) = 140
Others (20+28) = 48
Info (0) = 0
Upvotes: 0
Views: 857
Reputation: 5745
here is a full example of how you should do that:
df = pd.read_csv(io.StringIO('''PII Counts
CREDIT_CARD 10
CRYPTO 20
DATE_TIME 28
DOMAIN_NAME 40
EMAIL_ADDRESS 45
IBAN_CODE 20
IP_ADDRESS 100
NRP 38
LOCATION 36
PERSON 90
PHONE_NUMBER 105
BANK_NUMBER 29
DRIVER_LICENSE 45
ITIN 38
PASSPORT 49
SSN 90
NHS 0'''),sep=r'\s+')
mapper = dict(personal_info= ['PERSON','LOCATION','PHONE_NUMBER','EMAIL_ADDRESS','PASSPORT','SSN','DRIVER_LICENSE'],
finance=['CREDIT_CARD','BANK_NUMBER','ITIN','IBAN_CODE'],
info= ['NHS'],
network=['IP_ADDRESS','DOMAIN_NAME'],
others=['CRYPTO','DATE_TIME','NRP'])
mapper = {val:k for k,vals in mapper.items() for val in vals}
df['PII'] = df['PII'].map(mapper)
>>>df.groupby(['PII'],as_index=False).sum()
Output:
PII Counts
0 finance 97
1 info 0
2 network 140
3 others 86
4 personal_info 460
EXPLAINATION:
first, from the data you have showd us you rutn it into one dictionary the contains the keys as current value name and value as the group name.
second, you map the values with pd.Series.map
.
last, you groupby each group name and sum the result with pd.DataFrame.groupby
Upvotes: 1
Reputation: 14103
You can create a dict and use some list comprehension
# convert to a dict
d = {'personal_info': ['PERSON','LOCATION','PHONE_NUMBER','EMAIL_ADDRESS','PASSPORT','SSN','DRIVER_LICENSE'],
'finance':['CREDIT_CARD','BANK_NUMBER','ITIN','IBAN_CODE'],
'info': ['NHS'],
'network':['IP_ADDRESS','DOMAIN_NAME'],
'others':['CRYPTO','DATE_TIME','NRP']}
# list comprehension
l = [df[df['PII'].isin(d[x])]['Counts'].sum() for x in d]
# create a new frame using zip
new_df = pd.DataFrame(zip(d.keys(), l), columns=['Cluster', 'Count'])
Cluster Count
0 personal_info 460
1 finance 97
2 info 0
3 network 140
4 others 86
Upvotes: 1
Reputation: 150745
You can create an inverse dictionary and map:
d = {'personal_info': ['PERSON','LOCATION','PHONE_NUMBER','EMAIL_ADDRESS','PASSPORT','SSN','DRIVER_LICENSE'],
'finance':['CREDIT_CARD','BANK_NUMBER','ITIN','IBAN_CODE'],
'info': ['NHS'],
'network':['IP_ADDRESS','DOMAIN_NAME'],
'others':['CRYPTO','DATE_TIME','NRP']
}
d_inv = {x:k for k, v in d.items() for x in v}
(df['Counts'].groupby(df['PII'].map(d_inv)).sum()
.rename_axis('Cluster names') # rename to match output
.reset_index(name='Total count')
)
Output:
Cluster names Total count
0 finance 97
1 info 0
2 network 140
3 others 86
4 personal_info 460
Upvotes: 2