summarize dataframe columns by the most common values

Question

I have a dataframe that looks like the following:

source_ip    destination_ip    malware_tag    ransomware_tag    brutefore_tag    source_bytes    destination_bytes    label
ip_1         ip_2              True           True               False            10              20                   0
ip_1         ip_2              True           False              False            20              60                   0
ip_1         ip_2              True           False              False            30              100                   0
ip_1         ip_2              True           True               False            40              300                   0
ip_3         ip_4              False          False              True             5              20                   1
ip_3         ip_4              False          False              True             500              9                   1
ip_3         ip_4              False          True               True             200              15                   1

I want first, for each 'label' to create a new column called 'attack type' which is created based on the three boolean tag column; by considering only the most common/frequent attack flag. So I want the resulting dataframe to look as the following:

source_ip    destination_ip      source_bytes    destination_bytes    label    attack_type
ip_1         ip_2                10              20                   0        malware
ip_1         ip_2                20              60                   0        malware
ip_1         ip_2                30              100                  0        malware
ip_1         ip_2                40              300                  0        malware
ip_3         ip_4                5               20                    1       bruteforre
ip_3         ip_4                500             9                   1         bruteforce
ip_3         ip_4                200             15                  1         bruteforce

Second summarize (sum of source and destination bytes) the resulting dataframe by source and destination IPs. So the final dataframe should look like:

source_ip    destination_ip    total_source_bytes    total_destination_bytes    attack_type
ip_1         ip_2              100                   480                        malware
ip_3         ip_4              705                   44                         bruteforce

I would appreciate any help to achieve this. Thanks.

Quang Hoang · Accepted Answer

Let's try idxmax to extract the attack type for each label, then map to convert the label to the most common attack type:

atk_by_labels = df.filter(like='tag').groupby(df['label']).sum().idxmax(1)

# first summary
df['attack_type'] = df['label'].map(atk_by_labels)

# second summary
(df.groupby(['source_ip', 'destination_ip', 'attack_type'], as_index=False)
   [['total_source_bytes','total_destination_bytes']].sum()
)

summarize dataframe columns by the most common values

Answers (1)

Related Questions