Reputation: 121
I really can't figure out how to get around this error array length 488 does not match index length 9914
. I think it's got something to do with how I'm defining my dataframes but I really can't find where the problem lies.
my code is:
train_df.drop(['key','passenger_count','dropoff_longitude','dropoff_latitude','pickup_longitude','pickup_latitude','pickup_datetime'],axis=1,inplace=True)
test_df.drop(['passenger_count','dropoff_longitude','dropoff_latitude','pickup_longitude','pickup_latitude','pickup_datetime'],axis=1,inplace=True)
train_df.dropna(how = 'any', axis = 'rows', inplace=True)
train_df.isnull().sum()
y = train_df.pop('fare_amount')
x = train_df
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
dtrain = xgb.DMatrix(x_train, label=y_train)
dtest = xgb.DMatrix(x_test, label=y_test)
param = {
'max_depth':5,
'nthread':4,
'eval_metric': 'rmse',
'min_child_weight': 1,
'eta':0.3
}
model = xgb.train(param, dtrain)
pred = model.predict(dtest, ntree_limit=model.best_ntree_limit)
submission = pd.DataFrame({"key":test_df["key"], "fare_amount": pred},
columns = ['key', 'fare_amount']
)
The error is occurring in the last line where submission
is and the traceback looks like:
ValueError Traceback (most recent call last)
<ipython-input-193-1cb42e5ec957> in <module>()
57 pred = model.predict(dtest, ntree_limit=model.best_ntree_limit)
58 submission = pd.DataFrame({"key":test_df["key"], "fare_amount": pred},
ValueError: array length 488 does not match index length 9914
Both datasets start off with the same columns, but test.csv
doesn't have fare_amount
And the shape of test.csv
before I drop any columns is (9914,8)
, whereas train.csv
has (3034,9)
Upvotes: 0
Views: 2662
Reputation: 121
So to fix the problem I added a new variable x_predict = test_df.drop("key", axis=1)
and then added that to prediction = model.predict(xgb.DMatrix(x_pred), ntree_limit = model.best_ntree_limit)
Upvotes: 2
Reputation: 79
You are predicting using dataset "dtest", which is a subset of train_df. While test_df is a separate dataset.
Even if both test_df and pred have same length, they are 2 different dataset and linking them is pointless unless test_df = train_df
And even if they are similar, you will need to do the same DataFrame transformation on test_df as you did on train_df before linking them together.
Upvotes: 0