LThomas
LThomas

Reputation: 11

MinMaxScaler not scaling correctly

I am using sklearn MinMaxScaler code I got off of Lynda.com to scale my data sets for a prediction code. The feature ranger should be (0,1) but I noticed on my trial data that some columns are larger than 1. I believe this is causing my prediction code to not come out right. Can anybody help? Belo is the code I am using...

import pandas as pd
from sklearn.preproMinmaxcessing import MinMaxScaler

# Load training data set from CSV file
training_data_df = pd.read_csv("10596_data_training.csv")

# Load testing data set from CSV file
test_data_df = pd.read_csv("10596_data_test.csv")

# Load the trial data set from CSV file
trial_data_df = pd.read_csv("day05.csv")

# Data needs to be scaled to a small range like 0 to 1 for the neural
# network to work well.
scaler = MinMaxScaler(feature_range=(0, 1))

# Scale both the training inputs and outputs
scaled_training = scaler.fit_transform(training_data_df)
scaled_testing = scaler.transform(test_data_df)
scaled_trial = scaler.transform(trial_data_df)

# Print out the adjustment that the scaler applied to the total_earnings column of data
print("Note: total_hours values were scaled by multiplying by {:.10f} and     adding {:.6f}".format(scaler.scale_[40], scaler.min_[40]))

# Create new pandas DataFrame objects from the scaled data
scaled_training_df = pd.DataFrame(scaled_training,      columns=training_data_df.columns.values)
scaled_testing_df = pd.DataFrame(scaled_testing,     columns=test_data_df.columns.values)
scaled_trial_df = pd.DataFrame(scaled_trial, columns=trial_data_df.columns.values)

# Save scaled data dataframes to new CSV files
scaled_training_df.to_csv("10596_data_training_scaled.csv", index=False)
scaled_testing_df.to_csv("10596_data_test_scaled.csv", index=False)
scaled_trial_df.to_csv("day05_scaled.csv", index=False)

Upvotes: 1

Views: 3388

Answers (1)

Tim
Tim

Reputation: 2843

You're "training" your MinMaxScaler on a subset of your data, and then transforming a different subset. The MinMaxScaler is simply subtracting the minimum of the training set and then dividing by the max. If the trial set has values greater than the max of the training set or less than the min of the training set, you'll have values outside of the [0,1] range. This is expected and acceptable.

Upvotes: 5

Related Questions