Arnold
Arnold

Reputation: 4850

Confusing probabilities from scikit-learn randomforest

I have a time series of integer values which I'm trying to predict. I do this by a sliding window where it learns to associate 99 values to predict the next one. The values are between 0 and 128. The representation for X is a cube of n sliding windows of 99 long and each integer encoded to a one hot encoded vector of 128 elements long. The shape of this array is (n, 99, 128). The shape of Y is (n, 128). I see it as a multi-class problem as Y can take precisely one outcome.

This works fine with Keras/Tensorflow, but when I try to use RandomForest from scikit-learn it complains about the input vector being 3D instead of 2D. So I reshaped the input cube X into a 2D matrix of shape (n, 99 * 128). The results weren't great and in order to understand what's happening I requested the probabilities (see code below).

def rf(X_train, Y_train, X_val, Y_val, samples):
    clf = RandomForestClassifier(n_estimators=32, n_jobs=-1)
    clf.fit(X_train, Y_train)
    score = clf.score(X_val, Y_val)
    print('Score of randomforest =', score)

    # compute some samples
    for i in range(samples):
        index = random.randrange(0, len(X_val) - 1)
        xx = X_val[index].reshape(1, -1)
        probs = clf.predict_proba(xx)
        pred = clf.predict(xx)
        y_true = np.argmax(Y_val[index])
        y_hat = np.argmax(pred)
        print(index, '-', y_true, y_hat, xx.shape, len(probs))
        print(probs)
        print(pred)

The output I get from predict_proba is:

[array([[0.841, 0.159]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), 
 array([[1.]]), array([[1., 0.]]), array([[1., 0.]]), array([[1., 0.]]),
 array([[1., 0.]]), array([[1., 0.]]), array([[0.995, 0.005]]), array([[0.999,
 0.001]]), array([[0.994, 0.006]]), array([[1., 0.]]), array([[0.994, 0.006]]),
 array([[0.977, 0.023]]), array([[0.999, 0.001]]), array([[0.939, 0.061]]),
 array([[0.997, 0.003]]), array([[0.969, 0.031]]), array([[0.997, 0.003]]),
 array([[0.984, 0.016]]), array([[0.949, 0.051]]), array([[1., 0.]]),
 array([[0.95, 0.05]]), array([[1., 0.]]), array([[0.918, 0.082]]), 
 array([[0.887, 0.113]]), array([[1.]]), array([[0.88, 0.12]]), array([[1.]]),
 array([[0.884, 0.116]]), array([[0.941, 0.059]]), array([[1.]]), array([[0.941,
 0.059]]), array([[1.]]), array([[0.965, 0.035]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]]),
 array([[1.]]), array([[1.]]), array([[1.]]), array([[1.]])]

The output vector has a length of 128 all right, but why does it consist of a list, containing 2D arrays, sometimes containing one element and sometimes two? As far as I understand from the manual an array should be returned with dimension # samples * # classes, so in my example of shape (1,128).

Could someone help me in pointing out what I am doing wrong?

Edit 1

I did experiments along the lines suggested by @Vivek Kumar (thanks Vivek) in his comments. I input sequences of integers (X) and match them with the next integer in sequence (y). This is the code:

def rff(X_train, Y_train, X_val, Y_val, samples, cont=False):
    print('Input data:', X_train.shape, Y_train.shape, X_val.shape, Y_val.shape)
    clf = RandomForestClassifier(n_estimators=64, n_jobs=-1)
    clf.fit(X_train, Y_train)
    score = clf.score(X_val, Y_val)

    y_true = Y_val
    y_prob = clf.predict_proba(X_val)
    y_hat = clf.predict(X_val)
    print('y_true', y_true.shape, y_true)
    print('y_prob', y_prob.shape, y_prob)
    print('y_hat', y_hat.shape, y_hat)
    #sum_prob = np.sum(y_true == y_prob)
    sum_hat = np.sum(y_true == y_hat)
    print('Score of randomforest =', score)
    print('Score y_hat', sum_hat / len(X_val))
    #print('Score y_prob', sum_prob / len(X_val))

    # compute some individual samples
    for i in range(samples):
        index = random.randrange(0, len(X_val) - 1)
        y_true_i = Y_val[index]
        #y_prob_i = y_prob[index]
        y_hat_i = y_hat[index]
        print('{:4d} - {:3d}{:3d}'.format(index, y_true_i, y_hat_i))

And its output is:

Input data: (4272, 99) (4272,) (1257, 99) (1257,)
y_true (1257,) [ 0  0  0 ... 69 70 70]
y_prob (1257, 29) [[0.09375  0.       0.       ... 0.078125 0.078125 0.015625]
 [0.109375 0.       0.       ... 0.046875 0.0625   0.0625  ]
 [0.125    0.       0.       ... 0.015625 0.078125 0.015625]
 ...
 [0.078125 0.       0.       ... 0.       0.       0.      ]
 [0.046875 0.       0.       ... 0.       0.       0.      ]
 [0.078125 0.       0.       ... 0.       0.       0.      ]]
y_hat (1257,) [81 81 79 ... 67 67 65]
Score of randomforest = 0.20047732696897375
Score y_hat 0.20047732696897375
 228 -  76 77
  51 -  76  0
 563 -  81  0
 501 -   0 77
 457 -  79 79
 285 -  76 77
 209 -  81  0
1116 -  79  0
 178 -  72 77
1209 -  67 65

The probablities array has a consistent size, but its shape is complete weird (128, 29). Where this 29 is coming from...? Yet there is some improvement to report: the accuracy has greatly improved. It used to be around 0.0015, now it is about 0.20.

Any ideas on what the probabilities array represents?

Edit 2

My mistake was that by going back from 128 one-hot-encoded values to integers I did not take into account that I had just 29 unique values. predict_proba neatly predicts these 29 values because these are the ones it learned.

The only question remaining is which values do the probabilities predict? Let us suppose the classes to predict are 0, 101-128, predict_proba returns values for indices 0..28. What is the mapping of probabilities to classes: 0-->0, 1-->101, 2-->102, ... , 29-128? I couldn't find any any hint about this in the manual.

Upvotes: 1

Views: 585

Answers (1)

Vivek Kumar
Vivek Kumar

Reputation: 36619

First lets talk about your targets y.

  • A 2-d y is considered as a label-indicator matrix which is used for multi-label or multi-output multi-class task in scikit-learn. From your data this does not seem to be the case, so I dont think that you will want to one-hot encode the y.

  • Second thing about the targets in your problem is that you will first need to decide if you want a classification or a regression task. You say that you have a "time series of integer values". So the question is can those integers be compared to one-another numerically?

Example 1: Consider that you have a problem where you want to categorise some data into three countries as "Japan", "Russia", "USA".

  • Now these strings can be encoded as 1 ("Japan") ,2 ("Russia") and 3 ("USA") so that they can be used in the machine learning models. But we cannot compare these encodings as numbers as in 2 is bigger than 1 or smaller than 3. Here 1,2,3 are just numerical representation of categorical data which dont actually have any numerical sense to it. In this case, classification task is appropriate to place the data into these three classes.

  • But in any other scenario, like predicting the stock prices or predicting the temparatures etc, the numbers can and should be compared to one another and hence regression should be used (to predict the real-valued targets).

Example 2: For better understanding you can also think of the correctness (loss function) of your model. Lets assume for a model which predicts targets from 1 to 10 and that the correct target for a specific sample is 9.

  • In a classification task, only the correct prediction matters. It will not matter if the model predicted the target as 8 or 1.

  • But in a regression model, if a model predicted the output as 8, then you can say that it is better than a model which predicted the output as 1 for the same sample.

Hope you are getting my point. So for your problem, even though you have a finite number of integers (128) as targets, you will need to decide if they make sense in classification or regression.

Note: I am currently going further with classification as your original question.

Now coming to the features X

One-hot encoding is used if either there is no ordering present in the categories or you cannot determine that ordering correctly. The explanation I gave above for numerical comparison between categories hold here as well.

  • Consider another example of three categories: "high", "medium", "low". These have an inherent ordering in them Here if you encode as 0 (low), 1 (medium) and 2 (high), then they can be compared numerically. So you may decide to keep them as 0,1,2 or one-hot encode them.

  • As I said in my comment, Random forests are pretty robust against such things and should not affect the performance much, if the categories are encoded strategically. For example, performance may come down if you encode 0 (high), 1 (low), 2(medium) etc.

Now again coming to your case and my question from point 1: Can those integers be compared to one-another numerically? If yes, no need to one-hot encode the features. If no, do it.

Upvotes: 4

Related Questions