How to replace a value in pandas?

Question

Trying to group 23 different labels in second last column of "KDDTest+.csv" into four groups. Please note, I have deleted the last column of the csv prior to doing this.

I have read the .csv file using

df = pd.read_csv('KDDTrain+.csv', header=None, names = col_names)

where

col_names = ["duration","protocol_type","service","flag","src_bytes",
    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",
    "is_host_login","is_guest_login","count","srv_count","serror_rate",
    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]

If I print out the first 5 rows of the dataframe, this is the output (please note the 'label' column):

using print(df.head(5))

   duration protocol_type  ... dst_host_srv_rerror_rate    label
0         0           tcp  ...                     0.00   normal
1         0           udp  ...                     0.00   normal
2         0           tcp  ...                     0.00  neptune
3         0           tcp  ...                     0.01   normal
4         0           tcp  ...                     0.00   normal

I've tried both these methods for grouping based on what I found online:

Method 1:

df.replace(to_replace = ['ipsweep.', 'portsweep.', 'nmap.', 'satan.'], value = 'probe', inplace = True)
df.replace(to_replace = ['ftp_write.', 'guess_passwd.', 'imap.', 'multihop.', 'phf.', 'spy.', 'warezclient.', 'warezmaster.'], value = 'r2l', inplace = True)
df.replace(to_replace = ['buffer_overflow.', 'loadmodule.', 'perl.', 'rootkit.'], value = 'u2r', inplace = True)
df.replace(to_replace = ['back.', 'land.' , 'neptune.', 'pod.', 'smurf.', 'teardrop.'], value = 'dos', inplace = True)

Method 2:

df['label'] = df['label'].replace(['ipsweep.', 'portsweep.', 'nmap.', 'satan.'], 'probe',regex=True)
df['label'] = df['label'].replace(['ftp_write.', 'guess_passwd.', 'imap.', 'multihop.', 'phf.', 'spy.', 'warezclient.', 'warezmaster.'], 'r2l',regex=True)
df['label'] = df['label'].replace(['buffer_overflow.', 'loadmodule.', 'perl.', 'rootkit.'], 'u2r',regex=True)
df['label'] = df['label'].replace(['back.', 'land.' , 'neptune.', 'pod.', 'smurf.', 'teardrop.'], 'dos',regex=True)

However, this is still the output of printing the first 5 rows of the dataframe:

After replacing, first 5 rows of df: 

   duration protocol_type  ... dst_host_srv_rerror_rate    label
0         0           tcp  ...                     0.00   normal
1         0           udp  ...                     0.00   normal
2         0           tcp  ...                     0.00  neptune
3         0           tcp  ...                     0.01   normal
4         0           tcp  ...                     0.00   normal

I'm expecting the label column in row 2 to read 'dos' instead of 'neptune', but it's not happening.

What am I doing wrong? Any help is appreciated.

Pawel Kranzberg · Accepted Answer

By using "neptune." as the to_replace value with regex = True, you are telling Pandas to look for "neptune" and any single additional character (e.g., "neptuneX" or "neptune!"). Since that extra character is not present, the whole phrase does not get replaced. Instead you could use just "neptune", or "neptune.?" for 0 or 1 extra characters, or "neptune.*" for 0 or any greater number of extra characters.

Without regex = True, you are telling Pandas to look for literal "neptune." phrase.

How to replace a value in pandas?

Answers (2)

Related Questions