scikit-learn multi dimensional features

Question

i have a question concerning scikit-learn.

Is it possible to merge a multi dimensional feature list to one feature vector. For example: I have results from an application analysis and I would like to represent an application with one feature vector. In case of network traffic, an analysis result looks like the following:

    traffic = [
{
"http_body": "http_body_data", "length": 1024
}, 
{
"http_body2": "http_body_data2", "length": 2048
},
 ... and many more
]

So each dict in traffic list describes one network activity of a specific application.

I would like to generate a feature vector which contains all these information for one application to be able to generate a model out of the analysis results from a variety of applications.

How can I do this with scikit-learn?

Thank you in advance!

deformas deformas · Accepted Answer

If every application sends response of the same length (e.g. the first have length 1024, the second - 2048 etc.) you can just join all the results into one vector. For example if the responses in traffic is serialized lists (for example json).

def merge_feature_vector(traffic):
    result = []
    for id, data in enumerate(traffic):
        result.extend(json.loads(data['http_body%s' % id]))
    return result

Another approach is to use sklearn.feature_hasher. For example

def encode_traffic(traffic):
    result = {}
    for id, data in enumerate(traffic):
        result['html_body%s' % id] = data['html_body%s' % id]
    return result
...
features = [encode_traffic(traffic) for traffic in train]

h = FeatureHasher(n_features=10)
features = h.fit_transform(features)
features.toarray()

clf = RandomForestClassifier(n_estimators=100)
clf.train(features)

By the way it mostly depends on what you have in http_data

scikit-learn multi dimensional features

Answers (2)

Related Questions