Reputation: 3
i have a question concerning scikit-learn.
Is it possible to merge a multi dimensional feature list to one feature vector. For example: I have results from an application analysis and I would like to represent an application with one feature vector. In case of network traffic, an analysis result looks like the following:
traffic = [
{
"http_body": "http_body_data", "length": 1024
},
{
"http_body2": "http_body_data2", "length": 2048
},
... and many more
]
So each dict in traffic list describes one network activity of a specific application.
I would like to generate a feature vector which contains all these information for one application to be able to generate a model out of the analysis results from a variety of applications.
How can I do this with scikit-learn?
Thank you in advance!
Upvotes: 0
Views: 1026
Reputation: 3550
You can't have non-numeric features the value of a feature should be a number.
so what do you do if you have : ip:127.0.0.1 ip:192.168.0.1 ip:220.220.220.220 ? you create three features : ip_127.0.0.1, ip.192.168.0.1 and ip.220.220.220.220 and set the value of the first one 1 and the other two zero if the value of ip is 127.0.0.1.
If ip:val can have say more than 10 values you just create 10 features for the most common ones and create an ip_other feature and set that for all other samples who have another ip address.
Upvotes: 0
Reputation: 26
If every application sends response of the same length (e.g. the first have length 1024, the second - 2048 etc.) you can just join all the results into one vector. For example if the responses in traffic is serialized lists (for example json).
def merge_feature_vector(traffic):
result = []
for id, data in enumerate(traffic):
result.extend(json.loads(data['http_body%s' % id]))
return result
Another approach is to use sklearn.feature_hasher. For example
def encode_traffic(traffic):
result = {}
for id, data in enumerate(traffic):
result['html_body%s' % id] = data['html_body%s' % id]
return result
...
features = [encode_traffic(traffic) for traffic in train]
h = FeatureHasher(n_features=10)
features = h.fit_transform(features)
features.toarray()
clf = RandomForestClassifier(n_estimators=100)
clf.train(features)
By the way it mostly depends on what you have in http_data
Upvotes: 1