k191255 Sara Sameer
k191255 Sara Sameer

Reputation: 1

Inverse target variable scaling results in incorrect prediction results

I am working with a dataset of 'Automated Essay Grading', which has multiple sets of essays, each with its own target score range. For example, set 01 has a score range from 2 to 12, set 02 has a range from 0 to 3, and so on. To normalize these score ranges for training purposes, I used the MinMaxScaler to scale the scores between 0 and 1.

However, the predicted results are not as good as expected, with a high difference between the predicted and target scores. This could be due to the scaling process, as scaling back the predictions using the inverse_transform method does not seem to produce accurate results. This is how I have scaled back:

test_loss, test_mae = lstm_model.evaluate([padded_essay_test, features_test], target_test)

# Make predictions on test data
predictions = lstm_model.predict([padded_essay_test, features_test])


#Scale back
original_predictions = scaler.inverse_transform(predictions)
df_predictions = pd.DataFrame(original_predictions, columns=['original_predictions'])

scores_2d = [[score] for score in df_test['predicted_score']]
original_target = scaler.inverse_transform(scores_2d)
df_target = pd.DataFrame(original_target, columns=['original_target'])

print('Test Loss:', test_loss)
print('Test MAE:', test_mae)
print(df_predictions)
print(df_target)

Here are the reuslt:

Test Loss: 0.16554389894008636
Test MAE: 0.32293763756752014
      original_predictions
0                 6.706153
1                 6.293279
2                 7.408381
3                 6.629674
4                 6.368900
...                    ...
4213             15.695969
4214             14.502607
4215             13.892921
4216             14.528075
4217             15.792664

[4218 rows x 1 columns]
      original_target
0                 7.0
1                 8.0
2                 9.0
3                 9.0
4                 9.0
...               ...
4213             33.0
4214             35.0
4215             38.0
4216             32.0
4217             39.0

[4218 rows x 1 columns]

I also tried using the formula (predicted score * (max score - min score)) + min score to scale back the predictions for each essay set instead of using inverse_transform, but it did not work as well.

subset_predictions_list = []
for subset_name, subset_range in essay_set_ranges.items():
    subset_predictions = df_predictions.loc[df_test['essay_set'] == int(subset_name)]
    subset_min, subset_max = subset_range
    subset_predictions = subset_predictions * (subset_max - subset_min) + subset_min
    subset_predictions_list.append(subset_predictions)

df_subset_predictions = pd.concat(subset_predictions_list)

There are no outliers in the testing data range. The testing data range lies within the training data range. I have checked other StackOverflow posts regarding this as well, but none of those provided solutions solves the aforementioned problem.

Upvotes: 0

Views: 138

Answers (0)

Related Questions