Jan Musil
Jan Musil

Reputation: 508

Vowpal Wabbit possible hash collision

I have a model generated in VW, and I also generated two readable models on same data (first with '--readable_model' arg and the second one with '--invert_hash'). When i run my model on my test dataset and use --audit option, reported weights for some features are slightly different in compare with weights reported in readable models mentioned above. For example here is part of readable model trained on train.data:

213006:-0.19007
213008:-0.692261
213009:-0.203193

As you can see, feature with index 213008 has weight -0.692261 But when i run my model with -t option on test.data and with --audit option, then some weights are different in audit output:

-3.962444   q^featureXY:213008:1:-0.756017

What can cause this? I have more than 300k features, is it possible that this is caused by some hash collision? But if there is hash collision should not Vowpal Wabbit report this? As you can see, option -t was used while testing, so model should be 'stable'.

Upvotes: 1

Views: 413

Answers (1)

arielf
arielf

Reputation: 5952

vw allows hash collisions (on purpose)

This is referred to in the literature as "the hash trick".

It is not considered an error to have feature hash-collisions when learning from a large number of features, because a small number of collisions rarely have an adverse effect on learning. In many cases, a small collision rate may even help by lowering the generalization error.

What's the advantage of ignoring collisions?

Since there's no need to treat collisions in a special way, an obvious upside of the hash-trick is much faster learning.

Don't collisions make learning worse?

Hash collisions simply create (random) mixtures of the colliding feature subset. As long as the colliding subset is a small portion of the full feature-set (as can be expected when the hash-space is large enough), these collisions act as a random form of regularization and often (though not always) help avoid over-fitting.

What if I have a too small hash-space (too many collisions)?

If you have more than 300k features, that is indeed an extreme case. 300k is larger than the default hash-space size (2^18 = 262144) so the colliding portion is no longer small. In this case, you should increase the hash space by increasing -b <bits> where <bits> is higher than the default 18.

How can I know if I have too many collisions?

Progressive validation error, which is printed by vw as it learns, should give you a good hint at what the optimal -b value is for your data-set.

You may also try search for an optimal value, using vw-hypersearch like this:

    # find the best average loss when using between 19 and 26 bits hash-space
    vw-hypersearch 19 26 vw -b % [other-vw-options...] -d data-set

Upvotes: 1

Related Questions