Reputation: 453
I have a dataframe that looks like the following:
source_ip destination_ip malware_tag ransomware_tag brutefore_tag source_bytes destination_bytes label
ip_1 ip_2 True True False 10 20 0
ip_1 ip_2 True False False 20 60 0
ip_1 ip_2 True False False 30 100 0
ip_1 ip_2 True True False 40 300 0
ip_3 ip_4 False False True 5 20 1
ip_3 ip_4 False False True 500 9 1
ip_3 ip_4 False True True 200 15 1
I want first, for each 'label' to create a new column called 'attack type' which is created based on the three boolean tag column; by considering only the most common/frequent attack flag. So I want the resulting dataframe to look as the following:
source_ip destination_ip source_bytes destination_bytes label attack_type
ip_1 ip_2 10 20 0 malware
ip_1 ip_2 20 60 0 malware
ip_1 ip_2 30 100 0 malware
ip_1 ip_2 40 300 0 malware
ip_3 ip_4 5 20 1 bruteforre
ip_3 ip_4 500 9 1 bruteforce
ip_3 ip_4 200 15 1 bruteforce
Second summarize (sum of source and destination bytes) the resulting dataframe by source and destination IPs. So the final dataframe should look like:
source_ip destination_ip total_source_bytes total_destination_bytes attack_type
ip_1 ip_2 100 480 malware
ip_3 ip_4 705 44 bruteforce
I would appreciate any help to achieve this. Thanks.
Upvotes: 0
Views: 40
Reputation: 150735
Let's try idxmax
to extract the attack type for each label, then map
to convert the label to the most common attack type:
atk_by_labels = df.filter(like='tag').groupby(df['label']).sum().idxmax(1)
# first summary
df['attack_type'] = df['label'].map(atk_by_labels)
# second summary
(df.groupby(['source_ip', 'destination_ip', 'attack_type'], as_index=False)
[['total_source_bytes','total_destination_bytes']].sum()
)
Upvotes: 3