user5594750
user5594750

Reputation: 3

scikit-learn multi dimensional features

i have a question concerning scikit-learn.

Is it possible to merge a multi dimensional feature list to one feature vector. For example: I have results from an application analysis and I would like to represent an application with one feature vector. In case of network traffic, an analysis result looks like the following:

    traffic = [
{
"http_body": "http_body_data", "length": 1024
}, 
{
"http_body2": "http_body_data2", "length": 2048
},
 ... and many more
]

So each dict in traffic list describes one network activity of a specific application.

I would like to generate a feature vector which contains all these information for one application to be able to generate a model out of the analysis results from a variety of applications.

How can I do this with scikit-learn?

Thank you in advance!

Upvotes: 0

Views: 1026

Answers (2)

Ash
Ash

Reputation: 3550

You can't have non-numeric features the value of a feature should be a number.

so what do you do if you have : ip:127.0.0.1 ip:192.168.0.1 ip:220.220.220.220 ? you create three features : ip_127.0.0.1, ip.192.168.0.1 and ip.220.220.220.220 and set the value of the first one 1 and the other two zero if the value of ip is 127.0.0.1.

If ip:val can have say more than 10 values you just create 10 features for the most common ones and create an ip_other feature and set that for all other samples who have another ip address.

Upvotes: 0

deformas deformas
deformas deformas

Reputation: 26

If every application sends response of the same length (e.g. the first have length 1024, the second - 2048 etc.) you can just join all the results into one vector. For example if the responses in traffic is serialized lists (for example json).

def merge_feature_vector(traffic):
    result = []
    for id, data in enumerate(traffic):
        result.extend(json.loads(data['http_body%s' % id]))
    return result

Another approach is to use sklearn.feature_hasher. For example

def encode_traffic(traffic):
    result = {}
    for id, data in enumerate(traffic):
        result['html_body%s' % id] = data['html_body%s' % id]
    return result
...
features = [encode_traffic(traffic) for traffic in train]

h = FeatureHasher(n_features=10)
features = h.fit_transform(features)
features.toarray()

clf = RandomForestClassifier(n_estimators=100)
clf.train(features)

By the way it mostly depends on what you have in http_data

Upvotes: 1

Related Questions