Reattach ID column to data after passing through sklearn model

Question

I am working on trying to create a predictive regression model that forecasts the completion date of a number of orders.

My dataset looks like:

| ORDER_NUMBER | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 | Feature6 | TOTAL_DAYS_TO_COMPLETE | Feature8 | Feature9 | Feature10 | Feature11 | Feature12 | Feature13 | Feature14 | Feature15 | Feature16 | Feature17 | Feature18 | Feature19 | Feature20 | Feature21 | Feature22 | Feature23 | Feature24 | Feature25 | Feature26 | Feature27 | Feature28 | Feature29 | Feature30 | Feature31 |
|:------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:----------------------:|:--------:|:--------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|
|   102203591  |    12    |   2014   |    10    |   2014   |     1    |   2015   |           760          |    50    |    83    |     5     |     6     |     12    |     18    |     31    |     8     |     0     |     1     |     0     |     1     |     16    |   131.29  |  24.3768  |   158.82  |    1.13   |    6.52   |     10    |     51    |     39    |     27    |     88    |  1084938  |
|   102231010  |     2    |   2015   |     1    |   2015   |     2    |   2015   |           706          |    35    |    34    |     2     |     1     |     4     |     3     |     3     |     3     |     0     |     0     |     0     |     1     |     2     |   11.95   |   5.162   |   17.83   |    1.14   |    3.45   |     1     |     4     |     20    |     16    |     25    |   367140  |
|   102251893  |     6    |   2015   |     4    |   2015   |     3    |   2015   |          1143          |    36    |    43    |     1     |     2     |     4     |     5     |     6     |     3     |     1     |     0     |     0     |     1     |     5     |    8.55   |   5.653   |   34.51   |    4.59   |    6.1    |     0     |     1     |     17    |     30    |     12    |   103906  |
|   102287793  |     4    |   2015   |     2    |   2015   |     4    |   2015   |           733          |    45    |    71    |     4     |     1     |     6     |     35    |    727    |     6     |     0     |     3     |     15    |     0     |     19    |   174.69  |   97.448  |   319.98  |    1.49   |    3.28   |     20    |    113    |     71    |     59    |     71    |  1005041  |
|   102288060  |     6    |   2015   |     5    |   2015   |     4    |   2015   |          1092          |    26    |    21    |     1     |     1     |     3     |     2     |     2     |     1     |     0     |     0     |     0     |     0     |     2     |    4.73   |   4.5363  |   18.85   |    3.11   |    4.16   |     0     |     1     |     16    |     8     |     16    |   69062   |
|   102308069  |     8    |   2015   |     6    |   2015   |     5    |   2015   |           676          |    41    |    34    |     2     |     0     |     3     |     2     |     2     |     1     |     0     |     0     |     0     |     0     |     2     |    2.98   |   6.1173  |    11.3   |    1.36   |    1.85   |     0     |     1     |     17    |     12    |     3     |   145887  |
|   102319918  |     8    |   2015   |     7    |   2015   |     6    |   2015   |           884          |    25    |    37    |     1     |     1     |     3     |     2     |     3     |     2     |     0     |     0     |     1     |     0     |     2     |    5.57   |   3.7083  |    9.18   |    0.97   |    2.48   |     0     |     1     |     14    |     5     |     7     |   45243   |
|   102327578  |     6    |   2015   |     4    |   2015   |     6    |   2015   |           595          |    49    |    68    |     3     |     5     |     9     |     11    |     13    |     5     |     4     |     2     |     0     |     1     |     10    |   55.41   |  24.3768  |   104.98  |    2.03   |    4.31   |     10    |     51    |     39    |     26    |     40    |   418266  |
|   102337989  |     7    |   2015   |     5    |   2015   |     7    |   2015   |           799          |    50    |    66    |     5     |     6     |     12    |     21    |     29    |     12    |     0     |     0     |     0     |     1     |     20    |   138.79  |  24.3768  |   172.56  |    1.39   |    7.08   |     10    |     51    |     39    |     34    |    101    |  1229299  |
|   102450069  |     8    |   2015   |     7    |   2015   |    11    |   2015   |           456          |    20    |    120   |     2     |     1     |     3     |     12    |     14    |     8     |     0     |     0     |     0     |     0     |     7     |    2.92   |   6.561   |    12.3   |    1.43   |    1.87   |     2     |     1     |     15    |     6     |     6     |   142805  |
|   102514564  |     5    |   2016   |     3    |   2016   |     2    |   2016   |           639          |    25    |    35    |     1     |     2     |     4     |     3     |     6     |     3     |     0     |     0     |     0     |     0     |     3     |    4.83   |   4.648   |   14.22   |    2.02   |    3.06   |     0     |     1     |     15    |     5     |     13    |   62941   |
|   102528121  |    10    |   2015   |     9    |   2015   |     3    |   2016   |           413          |    15    |    166   |     1     |     1     |     3     |     2     |     3     |     2     |     0     |     0     |     0     |     0     |     2     |    4.23   |   1.333   |   15.78   |    8.66   |   11.84   |     1     |     4     |     8     |     6     |     3     |   111752  |
|   102564376  |     1    |   2016   |    12    |   2015   |     4    |   2016   |           802          |    27    |    123   |     2     |     1     |     4     |     3     |     3     |     3     |     0     |     1     |     0     |     0     |     3     |    1.27   |   2.063   |    6.9    |    2.73   |    3.34   |     1     |     4     |     14    |     20    |     6     |   132403  |
|   102564472  |     1    |   2016   |    12    |   2015   |     4    |   2016   |           817          |    27    |    123   |     0     |     1     |     2     |     1     |     1     |     1     |     0     |     0     |     0     |     0     |     1     |    1.03   |   2.063   |    9.86   |    4.28   |    4.78   |     1     |     4     |     14    |     22    |     4     |   116907  |
|   102599569  |     2    |   2016   |    12    |   2015   |     5    |   2016   |           425          |    47    |    151   |     1     |     2     |     4     |     3     |     4     |     3     |     0     |     0     |     0     |     0     |     2     |   27.73   |  15.8993  |    60.5   |    2.06   |    3.81   |     12    |    108    |     34    |     24    |     20    |   119743  |
|   102599628  |     2    |   2016   |    12    |   2015   |     5    |   2016   |           425          |    47    |    151   |     3     |     4     |     8     |     8     |     9     |     7     |     0     |     0     |     0     |     2     |     8     |   39.28   |  14.8593  |   91.26   |    3.5    |    6.14   |     12    |    108    |     34    |     38    |     15    |   173001  |
|   102606421  |     3    |   2016   |    12    |   2015   |     5    |   2016   |           965          |    55    |    161   |     5     |     11    |     17    |     29    |     44    |     11    |     1     |     1     |     0     |     1     |     22    |   148.06  |  23.7983  |   195.69  |     2     |    8.22   |     10    |     51    |     39    |     47    |    112    |  1196097  |
|   102621293  |     7    |   2016   |     5    |   2016   |     6    |   2016   |           701          |    42    |    27    |     2     |     1     |     4     |     3     |     3     |     1     |     0     |     0     |     0     |     1     |     2     |    8.39   |   3.7455  |   13.93   |    1.48   |    3.72   |     1     |     5     |     14    |     14    |     20    |   258629  |
|   102632364  |     7    |   2016   |     6    |   2016   |     6    |   2016   |           982          |    41    |    26    |     4     |     2     |     7     |     6     |     6     |     2     |     0     |     0     |     0     |     1     |     4     |   26.07   |   2.818   |   37.12   |    3.92   |   13.17   |     1     |     5     |     14    |     22    |     10    |   167768  |
|   102643207  |     9    |   2016   |     9    |   2016   |     7    |   2016   |           255          |     9    |    73    |     3     |     1     |     5     |     4     |     4     |     2     |     0     |     0     |     0     |     0     |     0     |    2.17   |   0.188   |    4.98   |   14.95   |   26.49   |     1     |     4     |     2     |     11    |     1     |   49070   |
|   102656091  |     9    |   2016   |     8    |   2016   |     7    |   2016   |           356          |    21    |    35    |     1     |     0     |     2     |     1     |     1     |     1     |     0     |     0     |     0     |     0     |     1     |    1.45   |   2.0398  |    5.54   |    2.01   |    2.72   |     1     |     4     |     14    |     15    |     3     |   117107  |
|   102660407  |     9    |   2016   |     8    |   2016   |     7    |   2016   |           462          |    21    |    31    |     2     |     0     |     3     |     2     |     2     |     1     |     0     |     0     |     0     |     0     |     2     |    3.18   |   2.063   |    8.76   |    2.7    |    4.25   |     1     |     4     |     14    |     14    |     10    |   151272  |
|   102665666  |    10    |   2016   |     9    |   2016   |     7    |   2016   |           235          |     9    |    64    |     0     |     1     |     2     |     1     |     2     |     1     |     0     |     0     |     0     |     0     |     0     |     1     |   0.188   |    2.95   |   10.37   |   15.69   |     1     |     4     |     2     |     10    |     1     |   52578   |
|   102665667  |    10    |   2016   |     9    |   2016   |     7    |   2016   |           235          |     9    |    64    |     0     |     1     |     2     |     1     |     2     |     1     |     0     |     0     |     0     |     0     |     0     |    0.72   |   0.188   |    2.22   |    7.98   |   11.81   |     1     |     4     |     2     |     10    |     1     |   52578   |
|   102665668  |    10    |   2016   |     9    |   2016   |     7    |   2016   |           235          |     9    |    64    |     0     |     1     |     2     |     1     |     2     |     1     |     0     |     0     |     0     |     0     |     0     |    0.9    |   0.188   |    2.24   |    7.13   |   11.91   |     1     |     4     |     2     |     10    |     1     |   52578   |
|   102666306  |     7    |   2016   |     6    |   2016   |     7    |   2016   |           235          |    16    |    34    |     3     |     1     |     5     |     5     |     6     |     4     |     0     |     0     |     0     |     0     |     3     |   14.06   |   3.3235  |   31.27   |    5.18   |    9.41   |     1     |     1     |     16    |     5     |     18    |   246030  |
|   102668177  |     8    |   2016   |     6    |   2016   |     8    |   2016   |           233          |    36    |    32    |     0     |     1     |     2     |     1     |     1     |     1     |     0     |     0     |     0     |     0     |     1     |    2.5    |   5.2043  |    8.46   |    1.15   |    1.63   |     0     |     1     |     14    |     2     |     4     |   89059   |
|   102669909  |     6    |   2016   |     4    |   2016   |     8    |   2016   |           244          |    46    |    105   |     4     |     11    |     16    |     28    |     30    |     15    |     1     |     2     |     1     |     1     |     25    |   95.49   |   26.541  |   146.89  |    1.94   |    5.53   |     1     |     51    |     33    |     9     |     48    |   78488   |
|   102670188  |     5    |   2016   |     4    |   2016   |     8    |   2016   |           413          |    20    |    109   |     1     |     1     |     2     |     2     |     3     |     2     |     0     |     0     |     0     |     0     |     1     |    2.36   |   6.338   |    8.25   |    0.93   |    1.3    |     2     |     1     |     14    |     5     |     3     |   117137  |
|   102671063  |     8    |   2016   |     6    |   2016   |     8    |   2016   |           296          |    46    |    44    |     2     |     4     |     7     |     7     |    111    |     3     |     1     |     0     |     1     |     0     |     7     |   12.96   |   98.748  |   146.24  |    1.35   |    1.48   |     20    |    113    |     70    |     26    |     9     |   430192  |
|   102672475  |     8    |   2016   |     7    |   2016   |     8    |   2016   |           217          |    20    |    23    |     0     |     1     |     2     |     1     |     2     |     1     |     0     |     0     |     0     |     0     |     1     |    0.5    |   4.9093  |    5.37   |    0.99   |    1.09   |     0     |     1     |     16    |     0     |     1     |   116673  |
|   102672477  |    10    |   2016   |     9    |   2016   |     8    |   2016   |           194          |    20    |    36    |     1     |     0     |     2     |     1     |     1     |     1     |     0     |     0     |     0     |     0     |     1     |    0.61   |   5.1425  |    3.65   |    0.59   |    0.71   |     0     |     1     |     16    |     0     |     2     |   98750   |
|   102672513  |    10    |   2016   |     9    |   2016   |     8    |   2016   |           228          |    20    |    36    |     1     |     1     |     3     |     2     |     2     |     1     |     0     |     0     |     0     |     0     |     1     |    0.25   |   5.1425  |    6.48   |    1.21   |    1.26   |     0     |     1     |     16    |     0     |     2     |   116780  |
|   102682943  |     5    |   2016   |     4    |   2016   |     8    |   2016   |           417          |    20    |    113   |     0     |     1     |     1     |     1     |     1     |     1     |     0     |     0     |     0     |     0     |     1     |    0.64   |   6.338   |    5.53   |    0.77   |    0.87   |     2     |     1     |     14    |     5     |     2     |   100307  |

The ORDER_NUMBER should not be a feature in the model -- it is a unique identifier that I would like to essentially not count in the model since it is just a random ID, but include in the final dataset, so I can tie back predictions and actual values to the order.

Currently, my code looks like this:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
import numpy as np

def get_feature_importances(cols, importances):
    feats = {}
    for feature, importance in zip(cols, importances):
        feats[feature] = importance

    importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})

    return importances.sort_values(by='Gini-importance', ascending = False)

def compare_values(arr1, arr2):
    thediff = 0
    thediffs = []
    for thing1, thing2 in zip(arr1, arr2):
        thediff = abs(thing1 - thing2)
        thediffs.append(thediff)

    return thediffs

def print_to_file(filepath, arr):
    with open(filepath, 'w') as f:
        for item in arr:
            f.write("%s
" % item)

# READ IN THE DATA TABLE ABOVE        
data = pd.read_csv('test.csv')

# create the labels, or field we are trying to estimate
label = data['TOTAL_DAYS_TO_COMPLETE']
# remove the header
label = label[1:]

# create the data, or the data that is to be estimated
data = data.drop('TOTAL_DAYS_TO_COMPLETE', axis=1)

# Remove the order number since we don't need it
data = data.drop('ORDER_NUMBER', axis=1)

# remove the header
data = data[1:]

# # split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size = 0.2)

rf = RandomForestRegressor(
    bootstrap = True,
    max_depth = None,
    max_features = 'sqrt',
    min_samples_leaf = 1,
    min_samples_split = 2,
    n_estimators  = 5000
)
rf.fit(X_train, y_train)
rf_predictions = rf.predict(X_test)
rf_differences = compare_values(y_test, rf_predictions)
rf_Avg = np.average(rf_differences)
print("#################################################")
print("DATA FOR RANDOM FORESTS")
print(rf_Avg)
importances = get_feature_importances(X_test.columns, rf.feature_importances_)
print()
print(importances)

If I print(y_test) and print(rf_predictions) I get something like:

**print(y_test)**
7
155
84
64
49
41
200
168
43
111
64
46
96
47
50
27
216
..

**print(rf_predictions)**
34.496
77.366
69.6105
61.6825
80.8495
79.8785
177.5465
129.014
70.0405
97.3975
82.4435
57.9575
108.018
57.5515
..

And it works. If I print out y_test and rf_predictions, I get the labels for the test data and the predicted label values.

However, I would like to see what orders are associated both the y_test values and the rf_predictions values. How can I keep that dataset and create a dataframe (like below):

| Order Number | Predicted Value   | Actual Value |
|--------------|-------------------|--------------|
| Foo0         | 34.496            | 7            |
| Foo1         | 77.366            | 155          |
| Foo2         | 69.6105           | 84           |
| Foo3         | 61.6825           | 64           |

I have tried looking at this post but I could not get a solution. I did try print(y_test, rf_predictions) and that did not do any good since I have .drop() the ORDER_NUMBER field.

Stef · Accepted Answer

As you're using pandas dataframes the index is retained in all your x/y train/test datasets, so you can re-assemble it after you applied the model. We just need to save the order numbers before dropping that column: order_numbers = data['ORDER_NUMBER']. The predictions rf_predictions are returned in the same order as the input data to rf.predict(X_test), i.e. rf_predictions[i] belongs to X_test.iloc[i].

This creates your required result dataset:

res = y_test.to_frame('Actual Value')
res.insert(0, 'Predicted Value', rf_predictions)
res = order_numbers.to_frame().join(res, how='inner')

Btw, data = data[1:] doesn't remove the header, it removes the first row, so there's no need to remove anything when you work with pandas dataframes.

So the final program will be:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
import numpy as np

def get_feature_importances(cols, importances):
    feats = {}
    for feature, importance in zip(cols, importances):
        feats[feature] = importance

    importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})

    return importances.sort_values(by='Gini-importance', ascending = False)

def compare_values(arr1, arr2):
    thediff = 0
    thediffs = []
    for thing1, thing2 in zip(arr1, arr2):
        thediff = abs(thing1 - thing2)
        thediffs.append(thediff)

    return thediffs

def print_to_file(filepath, arr):
    with open(filepath, 'w') as f:
        for item in arr:
            f.write("%s
" % item)

# READ IN THE DATA TABLE ABOVE        
data = pd.read_csv('test.csv')

# create the labels, or field we are trying to estimate
label = data['TOTAL_DAYS_TO_COMPLETE']

# create the data, or the data that is to be estimated
data = data.drop('TOTAL_DAYS_TO_COMPLETE', axis=1)

# Remove the order number since we don't need it
order_numbers = data['ORDER_NUMBER']
data = data.drop('ORDER_NUMBER', axis=1)

# # split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size = 0.2)

rf = RandomForestRegressor(
    bootstrap = True,
    max_depth = None,
    max_features = 'sqrt',
    min_samples_leaf = 1,
    min_samples_split = 2,
    n_estimators  = 5000
)
rf.fit(X_train, y_train)
rf_predictions = rf.predict(X_test)
rf_differences = compare_values(y_test, rf_predictions)
rf_Avg = np.average(rf_differences)
print("#################################################")
print("DATA FOR RANDOM FORESTS")
print(rf_Avg)
importances = get_feature_importances(X_test.columns, rf.feature_importances_)
print()
print(importances)

res = y_test.to_frame('Actual Value')
res.insert(0, 'Predicted Value', rf_predictions)
res = order_numbers.to_frame().join(res, how='inner')
print(res)

With your example data from above we get (for train_test_split with random_state=1):

    ORDER_NUMBER  Predicted Value  Actual Value
3      102287793         652.0746           733
14     102599569         650.3984           425
19     102643207         319.4964           255
20     102656091         388.6004           356
26     102668177         475.1724           233
27     102669909         671.9158           244
32     102672513         319.1550           228

Reattach ID column to data after passing through sklearn model

Answers (1)

Related Questions