Reputation: 357
I also tried to reshape both the X(8889,17)
and y(8889,1)
but it didn't help at all:
import pandas as pd
import numpy as np
from sklearn import preprocessing, cross_validation, neighbors, model_selection
songs_dataset = pd.read_json('MasterSongList.json')
songs_dataset.loc[:,'genres'] = songs_dataset['genres'].apply(''.join)
def consolidateGenre(genre):
if len(genre)>0:
return genre.split(':')[0]
else: return genre
songs_dataset.loc[:, 'genres'] = songs_dataset['genres'].apply(consolidateGenre)
audio_feature_list = [audio_feature for audio_feature in songs_dataset['audio_features']]
audio_features_headers = ['key','energy','liveliness','tempo','speechiness','acousticness','instrumentalness','time_signature'
,'duration','loudness','valence','danceability','mode','time_signature_confidence','tempo_confidence'
,'key_confidence','mode_confidence']
audio_features = pd.DataFrame(audio_feature_list, columns=audio_features_headers)
audio_features.loc[:,].dropna(axis=0,how='all',inplace=True)
audio_features['genres'] = songs_dataset['genres']
rock_rap = audio_features.loc[(audio_features['genres'] == 'rock') | (audio_features['genres'] == 'rap')]
rock_rap.reset_index(drop=True)
label_genres = np.array(rock_rap['genres']).reshape((len(label_genres),1))
final_features = rock_rap.drop('genres',axis = 1).astype(float)
final_features['speechiness'].fillna(final_features['speechiness'].mean(),inplace=True)
knn = neighbors.KNeighborsClassifier(n_neighbors = 3)
standard_scaler = preprocessing.StandardScaler()
final_features = standard_scaler.fit_transform(final_features)
X_train, y_train, X_test, y_test = cross_validation.train_test_split(final_features,label_genres,test_size=0.2)
knn.fit(X_train,y_train)
ValueError: Found input variables with inconsistent numbers of samples: [7111, 1778]
Upvotes: 0
Views: 1867
Reputation: 8187
Your problem is you're assigning the results of train_test_split
incorrectly, and so you're trying to fit the model on X_train
and X_test
instead of what you think you're testing. Use this instead:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(final_features,label_genres,test_size=0.2)
Incidentally, if you look at the number of samples that should give you a hint, as 7111 is almost exactly four times the size of 1778 (0.8 / 0.2 = 4).
Upvotes: 3