Jroosterman
Jroosterman

Reputation: 418

Numpy - How do you normalize specific features in a dataset?

There is set of data that has a mixture of continuous and symbolic data, such as the following:

data = [[duration, protocol, bytes, rate],
        [0,        tcp,      215,   0.45],
        [4,        udp,      1474,  0.63],
        [63,       icmp,     30,    0.07]]

The 1st, 3rd, and 4th columns are continuous features while the 2nd column is symbolic.

Is there a way to normalize the 1st, 3rd, and 4th columns without touching the 2nd, and without having to remove the second from the set of data?

Edit: For this problem, I want to normalize the data by making each column between 0 and 1 based on the min and max of each column.

Upvotes: 0

Views: 65

Answers (1)

Arne
Arne

Reputation: 10545

You could write a function to normalize a particular column in the way you want and then call it on the columns you want. For example:

import numpy as np

data = np.array([['duration', 'protocol', 'bytes', 'rate'],
                [0,           'tcp',      215,     0.45],
                [4,           'udp',      1474,    0.63],
                [63,          'icmp',     30,      0.07]])

def normalize_column(col):
    values = [float(x) for x in data[1:, col]]
    minimum = np.min(values)
    maximum = np.max(values)
    r = maximum - minimum
    data[1:, col] = (values - minimum) / r
    
for col in (0, 2, 3):
    normalize_column(col)
    
data
array([['duration', 'protocol', 'bytes', 'rate'],
       ['0.0', 'tcp', '0.128116', '0.678571'],
       ['0.063492', 'udp', '1.0', '1.0'],
       ['1.0', 'icmp', '0.0', '0.0']], dtype='<U8')

Upvotes: 1

Related Questions