dejsdukes
dejsdukes

Reputation: 121

Array Length Not Matching Index Length

I really can't figure out how to get around this error array length 488 does not match index length 9914. I think it's got something to do with how I'm defining my dataframes but I really can't find where the problem lies.

my code is:

   train_df.drop(['key','passenger_count','dropoff_longitude','dropoff_latitude','pickup_longitude','pickup_latitude','pickup_datetime'],axis=1,inplace=True)
test_df.drop(['passenger_count','dropoff_longitude','dropoff_latitude','pickup_longitude','pickup_latitude','pickup_datetime'],axis=1,inplace=True)

train_df.dropna(how = 'any', axis = 'rows', inplace=True)
train_df.isnull().sum()

y = train_df.pop('fare_amount')
x = train_df
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
dtrain = xgb.DMatrix(x_train, label=y_train)
dtest = xgb.DMatrix(x_test, label=y_test)
param = {
    'max_depth':5,
    'nthread':4,
    'eval_metric': 'rmse',
    'min_child_weight': 1,
    'eta':0.3
}
model = xgb.train(param, dtrain)
pred = model.predict(dtest, ntree_limit=model.best_ntree_limit)
submission = pd.DataFrame({"key":test_df["key"], "fare_amount": pred},
                         columns = ['key', 'fare_amount']

)

The error is occurring in the last line where submission is and the traceback looks like:

ValueError                                Traceback (most recent call last)
<ipython-input-193-1cb42e5ec957> in <module>()
     57 pred = model.predict(dtest, ntree_limit=model.best_ntree_limit)
     58 submission = pd.DataFrame({"key":test_df["key"], "fare_amount": pred},

ValueError: array length 488 does not match index length 9914

Both datasets start off with the same columns, but test.csv doesn't have fare_amount

And the shape of test.csv before I drop any columns is (9914,8), whereas train.csv has (3034,9)

Upvotes: 0

Views: 2662

Answers (2)

dejsdukes
dejsdukes

Reputation: 121

So to fix the problem I added a new variable x_predict = test_df.drop("key", axis=1) and then added that to prediction = model.predict(xgb.DMatrix(x_pred), ntree_limit = model.best_ntree_limit)

Upvotes: 2

Tom Antony
Tom Antony

Reputation: 79

You are predicting using dataset "dtest", which is a subset of train_df. While test_df is a separate dataset.

Even if both test_df and pred have same length, they are 2 different dataset and linking them is pointless unless test_df = train_df

And even if they are similar, you will need to do the same DataFrame transformation on test_df as you did on train_df before linking them together.

Upvotes: 0

Related Questions