Large difference in performance on validation and test data

Question

I have scraped some data from spotify to see if I can classify the music genre of different songs. I have split my data up into a test set and a remaining set, which I have then further divided into training and validation set.

When I run the model (I try to classify between 112 genres) I get 30% accuracy in the validation set. Of course this is not great, but to be expected with 112 genres and limited data. What really confuses me is that when I apply the model to the test data, accuracy goes down to 1%.

I am not sure why that is: as far as I can see the validation and test data should be comparable. I train the model on the training data which should be completely independent.

I must be making some mistake either allowing the model to peak into the validation data (better performance there) or mess up my test data.

Or maybe applying the model twice messes things up?

Any idea what could be going on or how to debug it?

Thanks a lot! Franka


from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle

# re-read data
track_df = pd.read_csv('track_df_corr.csv') 


features = [ 'acousticness', 'speechiness',
           'key', 'liveness', 'instrumentalness', 'energy', 'tempo',
            'loudness', 'danceability', 'valence',
           'duration_mins', 'year', 'genre']


track_df = track_df[features]

#First make a big split of all the data into test and train.
train, test = train_test_split(track_df, test_size=0.2, random_state = 0)

#Then create training and validation data set from the traindata.
# Read the data. Assign train and test data
# "full" is the data before preprocessing
X_full = train 
X_test_full = test 

# select to be predicted data
y = X_full.genre # just the target for the test data
y = pd.factorize(y)[0] # just keep the number - get rid of name by using [0] numbers needed for classifier
  
#Since we later on want to validate our data on the testdata, we also need to make sure we have a #y_test.
# select to be predicted data
y_test = X_test_full.genre # just the target for the test data
y_test = pd.factorize(y_test)[0] # just keep the number - get rid of name by using [0]
                    # numbers needed for classifier


# remove to be predicted variable
X_full.drop(['genre'], axis=1, inplace=True) # rest of training free of target, which is now stored in y
X_test_full.drop(['genre'], axis=1, inplace=True) # not sure if necessary but cannot hurt


# Break off validation set from training data (X_full)
# Remember we still have X_test_full as an entirely independend test set. 
# Here we just create our training and validation sets from X_full.
X_train_full, X_valid_full, y_train, y_valid = \
            train_test_split(X_full, y, train_size=0.8, test_size=0.2, random_state=0)
 
# General preprocessing steps: take care of categorical data (does not apply here).

categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() < 10 and 
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64', 'float64']]



# Keep selected columns only
my_cols = categorical_cols + numerical_cols

X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()



#Time to run the model.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


#Run our model on the TRAINING data
# FRR set up input values that are passed to the Bundle below

# Preprocessing for NUMERICAL data
numerical_transformer = SimpleImputer(strategy='median') 


# Preprocessing for CATEGORICAL data
categorical_transformer = Pipeline(steps=[ # FRR Pipeline of transforms with a "final estimator", here "categorical_transformer".
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# FRR Run the numerical_transformer and categorical_transformer defined above here:
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer( # frr Applies transformers to columns of an array or pandas DataFrame.
    transformers=[ #frr List of (name,transformer,cols) tuples specifying the transformer objects to 
                        #be applied to subsets of the data.
        ('num', numerical_transformer, numerical_cols), 
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model
model = RandomForestClassifier(n_estimators=100, random_state=0)

# Bundle preprocessing and modeling code in a pipeline
# clf  stands for clasiifier.
# Pipeline can be used to chain multiple estimators into one

# Preprocessing of training data, fit model 
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])


# "Calling fit on the pipeline is the same as calling *fit* on each estimator (here: prepoc and model) 
clf.fit(X_train, y_train)


# --------------------------------------------------------

#Test our model on the VALIDATION data

# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)

# Return the mean accuracy on the given test data and labels.
clf.score(X_valid, y_valid) # this is correct! 

# The code yields a value around 30%. 

# --------------------------------------------------------

Apply our model on the TESTING data
# Preprocessing of training data, fit model 
preds_test = clf.predict(X_test)
clf.score(X_test, y_test)

#The code yields a value around 1%.

yatu · Accepted Answer

The problem that I see is that you're encoding the train and test labels using pd.factorize. Since you're using pd.factorize on y and y_test independently, the resulting encodings will not correspond to one another. You want to use a LabelEncoder, so that when you fit the encoder using the train data, you then transform y_test using the same encoding scheme.

Here's an example to illustrate this:

from sklearn.preprocessing import LabelEncoder

l = [1,4,6,1,4]
le = LabelEncoder()
le.fit(l)
le.transform(l)
# array([0, 1, 2, 0, 1], dtype=int64)
le.transform([1,6,4])
# array([0, 2, 1], dtype=int64)

Here we get the correct encodings. However if we apply a pd.factorize, obviously pandas can't guess which are the correct encodings:

pd.factorize(l)[0]
# array([0, 1, 2, 0, 1], dtype=int64)
pd.factorize([1,6,4])[0]
# array([0, 1, 2], dtype=int64)

Large difference in performance on validation and test data

Answers (1)

Related Questions