imanoob_goeasy
imanoob_goeasy

Reputation: 53

How to replace a value in pandas?

Trying to group 23 different labels in second last column of "KDDTest+.csv" into four groups. Please note, I have deleted the last column of the csv prior to doing this.

I have read the .csv file using

df = pd.read_csv('KDDTrain+.csv', header=None, names = col_names)

where

col_names = ["duration","protocol_type","service","flag","src_bytes",
    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",
    "is_host_login","is_guest_login","count","srv_count","serror_rate",
    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]

If I print out the first 5 rows of the dataframe, this is the output (please note the 'label' column):

using print(df.head(5))

   duration protocol_type  ... dst_host_srv_rerror_rate    label
0         0           tcp  ...                     0.00   normal
1         0           udp  ...                     0.00   normal
2         0           tcp  ...                     0.00  neptune
3         0           tcp  ...                     0.01   normal
4         0           tcp  ...                     0.00   normal

I've tried both these methods for grouping based on what I found online:

Method 1:

df.replace(to_replace = ['ipsweep.', 'portsweep.', 'nmap.', 'satan.'], value = 'probe', inplace = True)
df.replace(to_replace = ['ftp_write.', 'guess_passwd.', 'imap.', 'multihop.', 'phf.', 'spy.', 'warezclient.', 'warezmaster.'], value = 'r2l', inplace = True)
df.replace(to_replace = ['buffer_overflow.', 'loadmodule.', 'perl.', 'rootkit.'], value = 'u2r', inplace = True)
df.replace(to_replace = ['back.', 'land.' , 'neptune.', 'pod.', 'smurf.', 'teardrop.'], value = 'dos', inplace = True)

Method 2:

df['label'] = df['label'].replace(['ipsweep.', 'portsweep.', 'nmap.', 'satan.'], 'probe',regex=True)
df['label'] = df['label'].replace(['ftp_write.', 'guess_passwd.', 'imap.', 'multihop.', 'phf.', 'spy.', 'warezclient.', 'warezmaster.'], 'r2l',regex=True)
df['label'] = df['label'].replace(['buffer_overflow.', 'loadmodule.', 'perl.', 'rootkit.'], 'u2r',regex=True)
df['label'] = df['label'].replace(['back.', 'land.' , 'neptune.', 'pod.', 'smurf.', 'teardrop.'], 'dos',regex=True)

However, this is still the output of printing the first 5 rows of the dataframe:

After replacing, first 5 rows of df: 

   duration protocol_type  ... dst_host_srv_rerror_rate    label
0         0           tcp  ...                     0.00   normal
1         0           udp  ...                     0.00   normal
2         0           tcp  ...                     0.00  neptune
3         0           tcp  ...                     0.01   normal
4         0           tcp  ...                     0.00   normal

I'm expecting the label column in row 2 to read 'dos' instead of 'neptune', but it's not happening.

What am I doing wrong? Any help is appreciated.

Upvotes: 5

Views: 136

Answers (2)

Pawel Kranzberg
Pawel Kranzberg

Reputation: 1308

By using "neptune." as the to_replace value with regex = True, you are telling Pandas to look for "neptune" and any single additional character (e.g., "neptuneX" or "neptune!"). Since that extra character is not present, the whole phrase does not get replaced. Instead you could use just "neptune", or "neptune.?" for 0 or 1 extra characters, or "neptune.*" for 0 or any greater number of extra characters.

Without regex = True, you are telling Pandas to look for literal "neptune." phrase.

Upvotes: 1

Pavithran Ramachandran
Pavithran Ramachandran

Reputation: 993

Maybe you are using "neptune." instead of "neptune"

My tests seem to work with "neptune"

>>> df
     label
0  neptune
>>> df["label"].replace(["neptune."], "normal", regex=True)
0    neptune
Name: label, dtype: object
>>> df["label"].replace(["neptune"], "normal", regex=True)
0    normal
Name: label, dtype: object
>>> df["label"].replace(["neptune"], "normal")
0    normal
Name: label, dtype: object

Upvotes: 1

Related Questions