Reputation: 1
I am working with a dataset of 'Automated Essay Grading', which has multiple sets of essays, each with its own target score range. For example, set 01 has a score range from 2 to 12, set 02 has a range from 0 to 3, and so on. To normalize these score ranges for training purposes, I used the MinMaxScaler to scale the scores between 0 and 1.
However, the predicted results are not as good as expected, with a high difference between the predicted and target scores. This could be due to the scaling process, as scaling back the predictions using the inverse_transform method does not seem to produce accurate results. This is how I have scaled back:
test_loss, test_mae = lstm_model.evaluate([padded_essay_test, features_test], target_test)
# Make predictions on test data
predictions = lstm_model.predict([padded_essay_test, features_test])
#Scale back
original_predictions = scaler.inverse_transform(predictions)
df_predictions = pd.DataFrame(original_predictions, columns=['original_predictions'])
scores_2d = [[score] for score in df_test['predicted_score']]
original_target = scaler.inverse_transform(scores_2d)
df_target = pd.DataFrame(original_target, columns=['original_target'])
print('Test Loss:', test_loss)
print('Test MAE:', test_mae)
print(df_predictions)
print(df_target)
Here are the reuslt:
Test Loss: 0.16554389894008636
Test MAE: 0.32293763756752014
original_predictions
0 6.706153
1 6.293279
2 7.408381
3 6.629674
4 6.368900
... ...
4213 15.695969
4214 14.502607
4215 13.892921
4216 14.528075
4217 15.792664
[4218 rows x 1 columns]
original_target
0 7.0
1 8.0
2 9.0
3 9.0
4 9.0
... ...
4213 33.0
4214 35.0
4215 38.0
4216 32.0
4217 39.0
[4218 rows x 1 columns]
I also tried using the formula (predicted score * (max score - min score)) + min score to scale back the predictions for each essay set instead of using inverse_transform, but it did not work as well.
subset_predictions_list = []
for subset_name, subset_range in essay_set_ranges.items():
subset_predictions = df_predictions.loc[df_test['essay_set'] == int(subset_name)]
subset_min, subset_max = subset_range
subset_predictions = subset_predictions * (subset_max - subset_min) + subset_min
subset_predictions_list.append(subset_predictions)
df_subset_predictions = pd.concat(subset_predictions_list)
There are no outliers in the testing data range. The testing data range lies within the training data range. I have checked other StackOverflow posts regarding this as well, but none of those provided solutions solves the aforementioned problem.
Upvotes: 0
Views: 138