Reputation: 7261
I am working on trying to create a predictive regression model that forecasts the completion date of a number of orders.
My dataset looks like:
| ORDER_NUMBER | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 | Feature6 | TOTAL_DAYS_TO_COMPLETE | Feature8 | Feature9 | Feature10 | Feature11 | Feature12 | Feature13 | Feature14 | Feature15 | Feature16 | Feature17 | Feature18 | Feature19 | Feature20 | Feature21 | Feature22 | Feature23 | Feature24 | Feature25 | Feature26 | Feature27 | Feature28 | Feature29 | Feature30 | Feature31 |
|:------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:----------------------:|:--------:|:--------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|
| 102203591 | 12 | 2014 | 10 | 2014 | 1 | 2015 | 760 | 50 | 83 | 5 | 6 | 12 | 18 | 31 | 8 | 0 | 1 | 0 | 1 | 16 | 131.29 | 24.3768 | 158.82 | 1.13 | 6.52 | 10 | 51 | 39 | 27 | 88 | 1084938 |
| 102231010 | 2 | 2015 | 1 | 2015 | 2 | 2015 | 706 | 35 | 34 | 2 | 1 | 4 | 3 | 3 | 3 | 0 | 0 | 0 | 1 | 2 | 11.95 | 5.162 | 17.83 | 1.14 | 3.45 | 1 | 4 | 20 | 16 | 25 | 367140 |
| 102251893 | 6 | 2015 | 4 | 2015 | 3 | 2015 | 1143 | 36 | 43 | 1 | 2 | 4 | 5 | 6 | 3 | 1 | 0 | 0 | 1 | 5 | 8.55 | 5.653 | 34.51 | 4.59 | 6.1 | 0 | 1 | 17 | 30 | 12 | 103906 |
| 102287793 | 4 | 2015 | 2 | 2015 | 4 | 2015 | 733 | 45 | 71 | 4 | 1 | 6 | 35 | 727 | 6 | 0 | 3 | 15 | 0 | 19 | 174.69 | 97.448 | 319.98 | 1.49 | 3.28 | 20 | 113 | 71 | 59 | 71 | 1005041 |
| 102288060 | 6 | 2015 | 5 | 2015 | 4 | 2015 | 1092 | 26 | 21 | 1 | 1 | 3 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 2 | 4.73 | 4.5363 | 18.85 | 3.11 | 4.16 | 0 | 1 | 16 | 8 | 16 | 69062 |
| 102308069 | 8 | 2015 | 6 | 2015 | 5 | 2015 | 676 | 41 | 34 | 2 | 0 | 3 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 2 | 2.98 | 6.1173 | 11.3 | 1.36 | 1.85 | 0 | 1 | 17 | 12 | 3 | 145887 |
| 102319918 | 8 | 2015 | 7 | 2015 | 6 | 2015 | 884 | 25 | 37 | 1 | 1 | 3 | 2 | 3 | 2 | 0 | 0 | 1 | 0 | 2 | 5.57 | 3.7083 | 9.18 | 0.97 | 2.48 | 0 | 1 | 14 | 5 | 7 | 45243 |
| 102327578 | 6 | 2015 | 4 | 2015 | 6 | 2015 | 595 | 49 | 68 | 3 | 5 | 9 | 11 | 13 | 5 | 4 | 2 | 0 | 1 | 10 | 55.41 | 24.3768 | 104.98 | 2.03 | 4.31 | 10 | 51 | 39 | 26 | 40 | 418266 |
| 102337989 | 7 | 2015 | 5 | 2015 | 7 | 2015 | 799 | 50 | 66 | 5 | 6 | 12 | 21 | 29 | 12 | 0 | 0 | 0 | 1 | 20 | 138.79 | 24.3768 | 172.56 | 1.39 | 7.08 | 10 | 51 | 39 | 34 | 101 | 1229299 |
| 102450069 | 8 | 2015 | 7 | 2015 | 11 | 2015 | 456 | 20 | 120 | 2 | 1 | 3 | 12 | 14 | 8 | 0 | 0 | 0 | 0 | 7 | 2.92 | 6.561 | 12.3 | 1.43 | 1.87 | 2 | 1 | 15 | 6 | 6 | 142805 |
| 102514564 | 5 | 2016 | 3 | 2016 | 2 | 2016 | 639 | 25 | 35 | 1 | 2 | 4 | 3 | 6 | 3 | 0 | 0 | 0 | 0 | 3 | 4.83 | 4.648 | 14.22 | 2.02 | 3.06 | 0 | 1 | 15 | 5 | 13 | 62941 |
| 102528121 | 10 | 2015 | 9 | 2015 | 3 | 2016 | 413 | 15 | 166 | 1 | 1 | 3 | 2 | 3 | 2 | 0 | 0 | 0 | 0 | 2 | 4.23 | 1.333 | 15.78 | 8.66 | 11.84 | 1 | 4 | 8 | 6 | 3 | 111752 |
| 102564376 | 1 | 2016 | 12 | 2015 | 4 | 2016 | 802 | 27 | 123 | 2 | 1 | 4 | 3 | 3 | 3 | 0 | 1 | 0 | 0 | 3 | 1.27 | 2.063 | 6.9 | 2.73 | 3.34 | 1 | 4 | 14 | 20 | 6 | 132403 |
| 102564472 | 1 | 2016 | 12 | 2015 | 4 | 2016 | 817 | 27 | 123 | 0 | 1 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1.03 | 2.063 | 9.86 | 4.28 | 4.78 | 1 | 4 | 14 | 22 | 4 | 116907 |
| 102599569 | 2 | 2016 | 12 | 2015 | 5 | 2016 | 425 | 47 | 151 | 1 | 2 | 4 | 3 | 4 | 3 | 0 | 0 | 0 | 0 | 2 | 27.73 | 15.8993 | 60.5 | 2.06 | 3.81 | 12 | 108 | 34 | 24 | 20 | 119743 |
| 102599628 | 2 | 2016 | 12 | 2015 | 5 | 2016 | 425 | 47 | 151 | 3 | 4 | 8 | 8 | 9 | 7 | 0 | 0 | 0 | 2 | 8 | 39.28 | 14.8593 | 91.26 | 3.5 | 6.14 | 12 | 108 | 34 | 38 | 15 | 173001 |
| 102606421 | 3 | 2016 | 12 | 2015 | 5 | 2016 | 965 | 55 | 161 | 5 | 11 | 17 | 29 | 44 | 11 | 1 | 1 | 0 | 1 | 22 | 148.06 | 23.7983 | 195.69 | 2 | 8.22 | 10 | 51 | 39 | 47 | 112 | 1196097 |
| 102621293 | 7 | 2016 | 5 | 2016 | 6 | 2016 | 701 | 42 | 27 | 2 | 1 | 4 | 3 | 3 | 1 | 0 | 0 | 0 | 1 | 2 | 8.39 | 3.7455 | 13.93 | 1.48 | 3.72 | 1 | 5 | 14 | 14 | 20 | 258629 |
| 102632364 | 7 | 2016 | 6 | 2016 | 6 | 2016 | 982 | 41 | 26 | 4 | 2 | 7 | 6 | 6 | 2 | 0 | 0 | 0 | 1 | 4 | 26.07 | 2.818 | 37.12 | 3.92 | 13.17 | 1 | 5 | 14 | 22 | 10 | 167768 |
| 102643207 | 9 | 2016 | 9 | 2016 | 7 | 2016 | 255 | 9 | 73 | 3 | 1 | 5 | 4 | 4 | 2 | 0 | 0 | 0 | 0 | 0 | 2.17 | 0.188 | 4.98 | 14.95 | 26.49 | 1 | 4 | 2 | 11 | 1 | 49070 |
| 102656091 | 9 | 2016 | 8 | 2016 | 7 | 2016 | 356 | 21 | 35 | 1 | 0 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1.45 | 2.0398 | 5.54 | 2.01 | 2.72 | 1 | 4 | 14 | 15 | 3 | 117107 |
| 102660407 | 9 | 2016 | 8 | 2016 | 7 | 2016 | 462 | 21 | 31 | 2 | 0 | 3 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 2 | 3.18 | 2.063 | 8.76 | 2.7 | 4.25 | 1 | 4 | 14 | 14 | 10 | 151272 |
| 102665666 | 10 | 2016 | 9 | 2016 | 7 | 2016 | 235 | 9 | 64 | 0 | 1 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0.188 | 2.95 | 10.37 | 15.69 | 1 | 4 | 2 | 10 | 1 | 52578 |
| 102665667 | 10 | 2016 | 9 | 2016 | 7 | 2016 | 235 | 9 | 64 | 0 | 1 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0.72 | 0.188 | 2.22 | 7.98 | 11.81 | 1 | 4 | 2 | 10 | 1 | 52578 |
| 102665668 | 10 | 2016 | 9 | 2016 | 7 | 2016 | 235 | 9 | 64 | 0 | 1 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0.9 | 0.188 | 2.24 | 7.13 | 11.91 | 1 | 4 | 2 | 10 | 1 | 52578 |
| 102666306 | 7 | 2016 | 6 | 2016 | 7 | 2016 | 235 | 16 | 34 | 3 | 1 | 5 | 5 | 6 | 4 | 0 | 0 | 0 | 0 | 3 | 14.06 | 3.3235 | 31.27 | 5.18 | 9.41 | 1 | 1 | 16 | 5 | 18 | 246030 |
| 102668177 | 8 | 2016 | 6 | 2016 | 8 | 2016 | 233 | 36 | 32 | 0 | 1 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 2.5 | 5.2043 | 8.46 | 1.15 | 1.63 | 0 | 1 | 14 | 2 | 4 | 89059 |
| 102669909 | 6 | 2016 | 4 | 2016 | 8 | 2016 | 244 | 46 | 105 | 4 | 11 | 16 | 28 | 30 | 15 | 1 | 2 | 1 | 1 | 25 | 95.49 | 26.541 | 146.89 | 1.94 | 5.53 | 1 | 51 | 33 | 9 | 48 | 78488 |
| 102670188 | 5 | 2016 | 4 | 2016 | 8 | 2016 | 413 | 20 | 109 | 1 | 1 | 2 | 2 | 3 | 2 | 0 | 0 | 0 | 0 | 1 | 2.36 | 6.338 | 8.25 | 0.93 | 1.3 | 2 | 1 | 14 | 5 | 3 | 117137 |
| 102671063 | 8 | 2016 | 6 | 2016 | 8 | 2016 | 296 | 46 | 44 | 2 | 4 | 7 | 7 | 111 | 3 | 1 | 0 | 1 | 0 | 7 | 12.96 | 98.748 | 146.24 | 1.35 | 1.48 | 20 | 113 | 70 | 26 | 9 | 430192 |
| 102672475 | 8 | 2016 | 7 | 2016 | 8 | 2016 | 217 | 20 | 23 | 0 | 1 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 0.5 | 4.9093 | 5.37 | 0.99 | 1.09 | 0 | 1 | 16 | 0 | 1 | 116673 |
| 102672477 | 10 | 2016 | 9 | 2016 | 8 | 2016 | 194 | 20 | 36 | 1 | 0 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0.61 | 5.1425 | 3.65 | 0.59 | 0.71 | 0 | 1 | 16 | 0 | 2 | 98750 |
| 102672513 | 10 | 2016 | 9 | 2016 | 8 | 2016 | 228 | 20 | 36 | 1 | 1 | 3 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 0.25 | 5.1425 | 6.48 | 1.21 | 1.26 | 0 | 1 | 16 | 0 | 2 | 116780 |
| 102682943 | 5 | 2016 | 4 | 2016 | 8 | 2016 | 417 | 20 | 113 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0.64 | 6.338 | 5.53 | 0.77 | 0.87 | 2 | 1 | 14 | 5 | 2 | 100307 |
The ORDER_NUMBER
should not be a feature in the model -- it is a unique identifier that I would like to essentially not count in the model since it is just a random ID, but include in the final dataset, so I can tie back predictions and actual values to the order.
Currently, my code looks like this:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
import numpy as np
def get_feature_importances(cols, importances):
feats = {}
for feature, importance in zip(cols, importances):
feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
return importances.sort_values(by='Gini-importance', ascending = False)
def compare_values(arr1, arr2):
thediff = 0
thediffs = []
for thing1, thing2 in zip(arr1, arr2):
thediff = abs(thing1 - thing2)
thediffs.append(thediff)
return thediffs
def print_to_file(filepath, arr):
with open(filepath, 'w') as f:
for item in arr:
f.write("%s\n" % item)
# READ IN THE DATA TABLE ABOVE
data = pd.read_csv('test.csv')
# create the labels, or field we are trying to estimate
label = data['TOTAL_DAYS_TO_COMPLETE']
# remove the header
label = label[1:]
# create the data, or the data that is to be estimated
data = data.drop('TOTAL_DAYS_TO_COMPLETE', axis=1)
# Remove the order number since we don't need it
data = data.drop('ORDER_NUMBER', axis=1)
# remove the header
data = data[1:]
# # split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size = 0.2)
rf = RandomForestRegressor(
bootstrap = True,
max_depth = None,
max_features = 'sqrt',
min_samples_leaf = 1,
min_samples_split = 2,
n_estimators = 5000
)
rf.fit(X_train, y_train)
rf_predictions = rf.predict(X_test)
rf_differences = compare_values(y_test, rf_predictions)
rf_Avg = np.average(rf_differences)
print("#################################################")
print("DATA FOR RANDOM FORESTS")
print(rf_Avg)
importances = get_feature_importances(X_test.columns, rf.feature_importances_)
print()
print(importances)
If I print(y_test)
and print(rf_predictions)
I get something like:
**print(y_test)**
7
155
84
64
49
41
200
168
43
111
64
46
96
47
50
27
216
..
**print(rf_predictions)**
34.496
77.366
69.6105
61.6825
80.8495
79.8785
177.5465
129.014
70.0405
97.3975
82.4435
57.9575
108.018
57.5515
..
And it works. If I print out y_test
and rf_predictions
, I get the labels for the test data and the predicted label values.
However, I would like to see what orders are associated both the y_test
values and the rf_predictions
values. How can I keep that dataset and create a dataframe (like below):
| Order Number | Predicted Value | Actual Value |
|--------------|-------------------|--------------|
| Foo0 | 34.496 | 7 |
| Foo1 | 77.366 | 155 |
| Foo2 | 69.6105 | 84 |
| Foo3 | 61.6825 | 64 |
I have tried looking at this post but I could not get a solution. I did try print(y_test, rf_predictions)
and that did not do any good since I have .drop()
the ORDER_NUMBER
field.
Upvotes: 3
Views: 2196
Reputation: 30589
As you're using pandas dataframes the index is retained in all your x/y train/test datasets, so you can re-assemble it after you applied the model. We just need to save the order numbers before dropping that column: order_numbers = data['ORDER_NUMBER']
. The predictions rf_predictions
are returned in the same order as the input data to rf.predict(X_test)
, i.e. rf_predictions[i]
belongs to X_test.iloc[i]
.
This creates your required result dataset:
res = y_test.to_frame('Actual Value')
res.insert(0, 'Predicted Value', rf_predictions)
res = order_numbers.to_frame().join(res, how='inner')
Btw, data = data[1:]
doesn't remove the header, it removes the first row, so there's no need to remove anything when you work with pandas dataframes.
So the final program will be:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
import numpy as np
def get_feature_importances(cols, importances):
feats = {}
for feature, importance in zip(cols, importances):
feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
return importances.sort_values(by='Gini-importance', ascending = False)
def compare_values(arr1, arr2):
thediff = 0
thediffs = []
for thing1, thing2 in zip(arr1, arr2):
thediff = abs(thing1 - thing2)
thediffs.append(thediff)
return thediffs
def print_to_file(filepath, arr):
with open(filepath, 'w') as f:
for item in arr:
f.write("%s\n" % item)
# READ IN THE DATA TABLE ABOVE
data = pd.read_csv('test.csv')
# create the labels, or field we are trying to estimate
label = data['TOTAL_DAYS_TO_COMPLETE']
# create the data, or the data that is to be estimated
data = data.drop('TOTAL_DAYS_TO_COMPLETE', axis=1)
# Remove the order number since we don't need it
order_numbers = data['ORDER_NUMBER']
data = data.drop('ORDER_NUMBER', axis=1)
# # split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size = 0.2)
rf = RandomForestRegressor(
bootstrap = True,
max_depth = None,
max_features = 'sqrt',
min_samples_leaf = 1,
min_samples_split = 2,
n_estimators = 5000
)
rf.fit(X_train, y_train)
rf_predictions = rf.predict(X_test)
rf_differences = compare_values(y_test, rf_predictions)
rf_Avg = np.average(rf_differences)
print("#################################################")
print("DATA FOR RANDOM FORESTS")
print(rf_Avg)
importances = get_feature_importances(X_test.columns, rf.feature_importances_)
print()
print(importances)
res = y_test.to_frame('Actual Value')
res.insert(0, 'Predicted Value', rf_predictions)
res = order_numbers.to_frame().join(res, how='inner')
print(res)
With your example data from above we get (for train_test_split
with random_state=1
):
ORDER_NUMBER Predicted Value Actual Value
3 102287793 652.0746 733
14 102599569 650.3984 425
19 102643207 319.4964 255
20 102656091 388.6004 356
26 102668177 475.1724 233
27 102669909 671.9158 244
32 102672513 319.1550 228
Upvotes: 1