Roy
Roy

Reputation: 945

Why average loss goes up when training using Vowpal Wabbit

I tried to use VW to train a regression model on a small set of examples (about 3112). I think I'm doing it correctly, yet it showed me weird results. Dug around but didn't find anything helpful.

$ cat sh600000.feat | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.040000   0.040000            1         1.0  -0.2000   0.0000       79
0.051155   0.062310            2         2.0   0.2000  -0.0496       79
0.046606   0.042056            4         4.0   0.4100   0.1482       79
0.052160   0.057715            8         8.0   0.0200   0.0021       78
0.064936   0.077711           16        16.0  -0.1800   0.0547       77
0.060507   0.056079           32        32.0   0.0000   0.3164       79
0.136933   0.213358           64        64.0  -0.5900  -0.0850       79
0.151692   0.166452          128       128.0   0.0700   0.0060       79
0.133965   0.116238          256       256.0   0.0900  -0.0446       78
0.179995   0.226024          512       512.0   0.3700  -0.0217       79
0.109296   0.038597         1024      1024.0   0.1200  -0.0728       79
0.579360   1.049425         2048      2048.0  -0.3700  -0.0084       79
0.485389   0.485389         4096      4096.0   1.9600   0.3934       79 h
0.517748   0.550036         8192      8192.0   0.0700   0.0334       79 h

finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506


$ wc model
      41      48     657 model

Questions:

  1. Why is the number of features in the output (readable) model less than the number of actual features? I counted that the training data contains 78 features (plus the bias that's 79 as shown during the training). The number of feature bits is 24, which should be far more than enough to avoid collision.

  2. Why does the average loss actually go up in the training as you can see in the above example?

  3. (Minor) I tried to increase the number of feature bits to 32, and it output an empty model. Why?


EDIT:

I tried to shuffle the input file, as well as using --holdout_off, as suggested. But the result is still almost the same - the average loss go up.

$ cat sh600000.feat.shuf | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache --holdout_off
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.040000   0.040000            1         1.0  -0.2000   0.0000       79
0.051155   0.062310            2         2.0   0.2000  -0.0496       79
0.046606   0.042056            4         4.0   0.4100   0.1482       79
0.052160   0.057715            8         8.0   0.0200   0.0021       78
0.071332   0.090504           16        16.0   0.0300   0.1203       79
0.043720   0.016108           32        32.0  -0.2200  -0.1971       78
0.142895   0.242071           64        64.0   0.0100  -0.1531       79
0.158564   0.174232          128       128.0   0.0500  -0.0439       79
0.150691   0.142818          256       256.0   0.3200   0.1466       79
0.197050   0.243408          512       512.0   0.2300  -0.0459       79
0.117398   0.037747         1024      1024.0   0.0400   0.0284       79
0.636949   1.156501         2048      2048.0   1.2500  -0.0152       79
0.363364   0.089779         4096      4096.0   0.1800   0.0071       79
0.477569   0.591774         8192      8192.0  -0.4800   0.0065       79
0.411068   0.344567        16384     16384.0   0.0700   0.0450       77

finished run
number of examples per pass = 3112
passes used = 10
weighted example sum = 31120
weighted label sum = -105.5
average loss = 0.423404
best constant = -0.0033901
total feature number = 2451800

The training examples are unique to each other so I doubt there is over-fitting problem (which, as I understand it, usually happens when the number of input is too small comparing the number of features).


EDIT2:

Tried to print the average loss for every pass of examples, and see that it mostly remains constant.

$ cat dist/sh600000.feat | vw --l1 1e-8 --l2 1e-8 -f dist/model -P 3112 --passes 10 -b 24 --cache_file dist/cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.498822   0.498822         3112      3112.0   0.0800   0.0015       79 h
0.476677   0.454595         6224      6224.0  -0.2200  -0.0085       79 h
0.466413   0.445856         9336      9336.0   0.0200  -0.0022       79 h
0.490221   0.561506        12448     12448.0   0.0700  -0.1113       79 h

finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506

Also another try without the --l1, --l2 and -b parameters:

$ cat dist/sh600000.feat | vw -f dist/model -P 3112 --passes 10 --cache_file dist/cacheNum weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.520286   0.520286         3112      3112.0   0.0800  -0.0021       79 h
0.488581   0.456967         6224      6224.0  -0.2200  -0.0137       79 h
0.474247   0.445538         9336      9336.0   0.0200  -0.0299       79 h
0.496580   0.563450        12448     12448.0   0.0700  -0.1727       79 h
0.533413   0.680958        15560     15560.0  -0.1700   0.0322       79 h
0.524531   0.480201        18672     18672.0  -0.9800  -0.0573       79 h

finished run
number of examples per pass = 2801
passes used = 7
weighted example sum = 19608
weighted label sum = -212.58
average loss = 0.491739 h
best constant = -0.0108415
total feature number = 1544713

Does that mean it's normal for average loss to go up during one pass, but as long as multiple pass gets the same loss then it's fine?

Upvotes: 2

Views: 1450

Answers (1)

truf
truf

Reputation: 3111

  1. Model file stores only non-zero weights. So most likely others got nulled especially if you are using --l1
  2. It may be caused by many reasons. Perhaps your dataset isn't shuffled well enough. If you sort your dataset so examples labeled -1 will be in first half and examples labeled 1 will be in second then your model will show very good convergence on first half, but you'll see avg loss bump as it reaches 2nd half. So it may be disbalance in dataset. As for last two losses - these are holdout losses (marked with 'h' at end of line) and may point that model is overfitted. Pls refer to my other answer.
  3. Well, in master branch usage of -b 32 is even currently blocked. You shall use up to -b 31. On practice -b 24-28 is usually enough even for dozens of thousands of features.

I would recommend you to get up-to-date VW version from github

Upvotes: 3

Related Questions