sheel
sheel

Reputation: 509

How to make clusters of Pandas data frame?

I am trying to make a cluster of the following pandas data frame and trying to give the names. E.g - "Personal Info" is cluster name and it consist of (PERSON,LOCATION,PHONE_NUMBER,EMAIL_ADDRESS,PASSPORT,SSN, DRIVER_LICENSE) and also addition of there Counts. which will be 460.

Clusters:
for reference I am providing clusters structure

Input data:

Names              Counts

CREDIT_CARD        10
CRYPTO             20
DATE_TIME          28
DOMAIN_NAME        40
EMAIL_ADDRESS      45
IBAN_CODE          20
IP_ADDRESS         100
NRP                38
LOCATION           36
PERSON             90
PHONE_NUMBER       105
BANK_NUMBER        29
DRIVER_LICENSE     45
ITIN               38
PASSPORT           49
SSN                90
NHS                0

Output:

Cluster names         Total count

Personal Info        (90+36+105+45+49+90) = 460
Finance              (10+29+38+20) = 97  
Network              (100+40) = 140
Others               (20+28) = 48
Info                 (0) = 0

Upvotes: 0

Views: 857

Answers (3)

adir abargil
adir abargil

Reputation: 5745

here is a full example of how you should do that:

df = pd.read_csv(io.StringIO('''PII               Counts

CREDIT_CARD        10
CRYPTO             20
DATE_TIME          28
DOMAIN_NAME        40
EMAIL_ADDRESS      45
IBAN_CODE          20
IP_ADDRESS         100
NRP                38
LOCATION           36
PERSON             90
PHONE_NUMBER       105
BANK_NUMBER        29
DRIVER_LICENSE     45
ITIN               38
PASSPORT           49
SSN                90
NHS                0'''),sep=r'\s+')

mapper = dict(personal_info= ['PERSON','LOCATION','PHONE_NUMBER','EMAIL_ADDRESS','PASSPORT','SSN','DRIVER_LICENSE'],
    finance=['CREDIT_CARD','BANK_NUMBER','ITIN','IBAN_CODE'],
    info= ['NHS'],
    network=['IP_ADDRESS','DOMAIN_NAME'],
    others=['CRYPTO','DATE_TIME','NRP'])
mapper = {val:k for  k,vals in mapper.items() for val in vals}
df['PII'] = df['PII'].map(mapper)
>>>df.groupby(['PII'],as_index=False).sum()

Output:

    PII             Counts
0   finance         97
1   info            0
2   network         140
3   others          86
4   personal_info   460

EXPLAINATION:

first, from the data you have showd us you rutn it into one dictionary the contains the keys as current value name and value as the group name.

second, you map the values with pd.Series.map.

last, you groupby each group name and sum the result with pd.DataFrame.groupby

Upvotes: 1

It_is_Chris
It_is_Chris

Reputation: 14103

You can create a dict and use some list comprehension

# convert to a dict
d = {'personal_info': ['PERSON','LOCATION','PHONE_NUMBER','EMAIL_ADDRESS','PASSPORT','SSN','DRIVER_LICENSE'],
    'finance':['CREDIT_CARD','BANK_NUMBER','ITIN','IBAN_CODE'],
    'info': ['NHS'],
    'network':['IP_ADDRESS','DOMAIN_NAME'],
    'others':['CRYPTO','DATE_TIME','NRP']}

# list comprehension
l = [df[df['PII'].isin(d[x])]['Counts'].sum() for x in d]
# create a new frame using zip
new_df = pd.DataFrame(zip(d.keys(), l), columns=['Cluster', 'Count'])

         Cluster  Count
0  personal_info    460
1        finance     97
2           info      0
3        network    140
4         others     86

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150745

You can create an inverse dictionary and map:

d = {'personal_info': ['PERSON','LOCATION','PHONE_NUMBER','EMAIL_ADDRESS','PASSPORT','SSN','DRIVER_LICENSE'],
    'finance':['CREDIT_CARD','BANK_NUMBER','ITIN','IBAN_CODE'],
    'info': ['NHS'],
    'network':['IP_ADDRESS','DOMAIN_NAME'],
    'others':['CRYPTO','DATE_TIME','NRP']
    }

d_inv = {x:k for k, v in d.items() for x in v}

(df['Counts'].groupby(df['PII'].map(d_inv)).sum()
   .rename_axis('Cluster names')       # rename to match output
   .reset_index(name='Total count')
)

Output:

   Cluster names  Total count
0        finance           97
1           info            0
2        network          140
3         others           86
4  personal_info          460

Upvotes: 2

Related Questions