Reputation: 418
There is set of data that has a mixture of continuous and symbolic data, such as the following:
data = [[duration, protocol, bytes, rate],
[0, tcp, 215, 0.45],
[4, udp, 1474, 0.63],
[63, icmp, 30, 0.07]]
The 1st, 3rd, and 4th columns are continuous features while the 2nd column is symbolic.
Is there a way to normalize the 1st, 3rd, and 4th columns without touching the 2nd, and without having to remove the second from the set of data?
Edit: For this problem, I want to normalize the data by making each column between 0 and 1 based on the min and max of each column.
Upvotes: 0
Views: 65
Reputation: 10545
You could write a function to normalize a particular column in the way you want and then call it on the columns you want. For example:
import numpy as np
data = np.array([['duration', 'protocol', 'bytes', 'rate'],
[0, 'tcp', 215, 0.45],
[4, 'udp', 1474, 0.63],
[63, 'icmp', 30, 0.07]])
def normalize_column(col):
values = [float(x) for x in data[1:, col]]
minimum = np.min(values)
maximum = np.max(values)
r = maximum - minimum
data[1:, col] = (values - minimum) / r
for col in (0, 2, 3):
normalize_column(col)
data
array([['duration', 'protocol', 'bytes', 'rate'],
['0.0', 'tcp', '0.128116', '0.678571'],
['0.063492', 'udp', '1.0', '1.0'],
['1.0', 'icmp', '0.0', '0.0']], dtype='<U8')
Upvotes: 1