Reputation: 53
Trying to group 23 different labels in second last column of "KDDTest+.csv" into four groups. Please note, I have deleted the last column of the csv prior to doing this.
I have read the .csv file using
df = pd.read_csv('KDDTrain+.csv', header=None, names = col_names)
where
col_names = ["duration","protocol_type","service","flag","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
If I print out the first 5 rows of the dataframe, this is the output (please note the 'label' column):
using print(df.head(5))
duration protocol_type ... dst_host_srv_rerror_rate label
0 0 tcp ... 0.00 normal
1 0 udp ... 0.00 normal
2 0 tcp ... 0.00 neptune
3 0 tcp ... 0.01 normal
4 0 tcp ... 0.00 normal
I've tried both these methods for grouping based on what I found online:
Method 1:
df.replace(to_replace = ['ipsweep.', 'portsweep.', 'nmap.', 'satan.'], value = 'probe', inplace = True)
df.replace(to_replace = ['ftp_write.', 'guess_passwd.', 'imap.', 'multihop.', 'phf.', 'spy.', 'warezclient.', 'warezmaster.'], value = 'r2l', inplace = True)
df.replace(to_replace = ['buffer_overflow.', 'loadmodule.', 'perl.', 'rootkit.'], value = 'u2r', inplace = True)
df.replace(to_replace = ['back.', 'land.' , 'neptune.', 'pod.', 'smurf.', 'teardrop.'], value = 'dos', inplace = True)
Method 2:
df['label'] = df['label'].replace(['ipsweep.', 'portsweep.', 'nmap.', 'satan.'], 'probe',regex=True)
df['label'] = df['label'].replace(['ftp_write.', 'guess_passwd.', 'imap.', 'multihop.', 'phf.', 'spy.', 'warezclient.', 'warezmaster.'], 'r2l',regex=True)
df['label'] = df['label'].replace(['buffer_overflow.', 'loadmodule.', 'perl.', 'rootkit.'], 'u2r',regex=True)
df['label'] = df['label'].replace(['back.', 'land.' , 'neptune.', 'pod.', 'smurf.', 'teardrop.'], 'dos',regex=True)
However, this is still the output of printing the first 5 rows of the dataframe:
After replacing, first 5 rows of df:
duration protocol_type ... dst_host_srv_rerror_rate label
0 0 tcp ... 0.00 normal
1 0 udp ... 0.00 normal
2 0 tcp ... 0.00 neptune
3 0 tcp ... 0.01 normal
4 0 tcp ... 0.00 normal
I'm expecting the label column in row 2 to read 'dos' instead of 'neptune', but it's not happening.
What am I doing wrong? Any help is appreciated.
Upvotes: 5
Views: 136
Reputation: 1308
By using "neptune."
as the to_replace
value with regex = True
, you are telling Pandas to look for "neptune"
and any single additional character (e.g., "neptuneX" or "neptune!"). Since that extra character is not present, the whole phrase does not get replaced. Instead you could use just "neptune"
, or "neptune.?"
for 0 or 1 extra characters, or "neptune.*"
for 0 or any greater number of extra characters.
Without regex = True
, you are telling Pandas to look for literal "neptune."
phrase.
Upvotes: 1
Reputation: 993
Maybe you are using "neptune."
instead of "neptune"
My tests seem to work with "neptune"
>>> df
label
0 neptune
>>> df["label"].replace(["neptune."], "normal", regex=True)
0 neptune
Name: label, dtype: object
>>> df["label"].replace(["neptune"], "normal", regex=True)
0 normal
Name: label, dtype: object
>>> df["label"].replace(["neptune"], "normal")
0 normal
Name: label, dtype: object
Upvotes: 1