Reputation: 3
New to Pandas and have a question that I cannot answer on my own. For context, this is output from a firewall. it generates millions of packets and I am trying to aggregate that data into a firewall ruleset. The best way I've come up with is to identify traffic based on the destination IP.
The source/dest ports will change if they are ephemeral so it's important that I aggregate them into the same row. That way I can determine the range of ports for the ruleset.
RAW CSV:
dvc,"src_interface",transport,"src_ip","src_port","dest_ip","dest_port",direction,action,cause,count "Firewall-1",outside,tcp,"4.4.4.4",53,"1.1.1.1",1025,outbound,allowed,"",2 "Firewall-1",outside,tcp,"4.4.4.4",53,"1.1.1.1",1026,outbound,allowed,"",2 "Firewall-1",outside,tcp,"4.4.4.4",22,"1.1.1.1",1028,outbound,allowed,"",2 "Firewall-1",outside,tcp,"3.3.3.3",22,"2.2.2.2",2200,outbound,allowed,"",2
Dataframe:
dvc src_interface transport src_ip src_port dest_ip dest_port direction action cause count
0 Firewall-1 outside tcp 4.4.4.4 53 1.1.1.1 1025 outbound allowed NaN 2
1 Firewall-1 outside tcp 4.4.4.4 53 1.1.1.1 1026 outbound allowed NaN 2
2 Firewall-1 outside tcp 4.4.4.4 53 1.1.1.1 1028 outbound allowed NaN 2
3 Firewall-1 outside tcp 3.3.3.3 22 2.2.2.2 2200 outbound allowed NaN 2
How would I go about merging rows with the same dest_ip?
CODE:
df = pd.concat([pd.read_csv(f) for f in glob.glob('*.csv')], ignore_index = True)
index_cols = df.columns.tolist()
index_cols.remove('dest_ip')
df = df.groupby(index_cols, as_index=False)['dest_ip'].apply(list)
print(df)
Expected Output:
Firewall-1 outside tcp 4.4.4.4 53 1.1.1.1 1025-1026,1028 outbound allowed nan 2
Firewall-1 outside tcp 3.3.3.3 22 2.2.2.2 2200 outbound allowed nan 2
Most examples I've found online involve joining two dataframes whereas I only have the one. Any help would be appreciated. Thanks in advance!
Upvotes: 0
Views: 11005
Reputation: 942
Try this. Group all the columns where you expected information to be duplicated and then aggregate the different “dest_port” values into a list:
df = pd.DataFrame([
["Firewall-1","outside","tcp","4.4.4.4",53,"1.1.1.1",1025,"outbound","allowed","",2],
["Firewall-1","outside","tcp","4.4.4.4",53,"1.1.1.1",1026,"outbound","allowed","",2],
["Firewall-1","outside","tcp","4.4.4.4",22,"1.1.1.1",1028,"outbound","allowed","",2],
["Firewall-1","outside","tcp","3.3.3.3",22,"2.2.2.2",2200,"outbound", "allowed","",2]
],
columns=["dvc","src_interface","transport","src_ip","src_port","dest_ip","dest_port","direction", "action", "cause", "count"])
index_cols = df.columns.tolist()
index_cols.remove("dest_port")
df = df.groupby(index_cols)["dest_port"].apply(list)
df = df.reset_index()
this results in 3 remaining rows and not 2 as in your desired output:
dvc src_interface transport src_ip src_port dest_ip direction action cause count dest_port
0 Firewall-1 outside tcp 3.3.3.3 22 2.2.2.2 outbound allowed 2 [2200]
1 Firewall-1 outside tcp 4.4.4.4 22 1.1.1.1 outbound allowed 2 [1028]
2 Firewall-1 outside tcp 4.4.4.4 53 1.1.1.1 outbound allowed 2 [1025, 1026]
Upvotes: 2
Reputation: 347
I think the following might do what you're looking for:
import pandas as pd
#create practice dataframe. will remove rows if values in 'key' are duplicate
df = pd.DataFrame({'key':[1,1,3,4],'color':[1,2,3,2],'house':[1,2,3,7]})
print(df.drop_duplicates(['key']))
Original dataframe:
key color house
1 1 1
1 2 2
3 3 3
4 2 7
Output dataframe:
key color house
1 1 1
3 3 3
4 2 7
Upvotes: 0