Stefano Potter
Stefano Potter

Reputation: 3577

Difference in Cross validated results between R and python

I have a data frame like so:

    log.comb  CDEM_TWI  Gruber_Ruggedness      dNBR  TC_Change_Sexton_Rel  \
0   8.714914  10.70240           0.626106  0.701591             -27.12220   
1   6.501334  10.65650           1.146360  0.693891             -35.52890   
2   8.946111  13.58910           1.146360  0.513136               7.00000   
3   8.955151   9.85036           1.126980  0.673891              13.81380   
4   7.751379   7.28264           0.000000  0.256136              10.06940   
5   8.895197   8.36555           0.000000  0.506000             -27.61340   
6   8.676571  12.92650           0.000000  0.600627             -44.48400   
7   8.562267  12.76980           0.519255  0.747009             -29.84790   
8   9.052766  11.81580           0.519255  0.808336             -29.00900   
9   9.133744   9.42046           0.484616  0.604891             -18.53550   
10  8.221441   9.53682           0.484616  0.817336             -21.39920   
11  8.398913  12.32050           0.519255  0.814745             -18.12080   
12  7.587468  11.08880           1.274430  0.590282              92.85710   
13  7.983136   8.95073           1.274430  0.316000             -10.34480   
14  9.044404  11.18440           0.698818  0.608600             -14.77000   
15  8.370293  11.96980           0.687634  0.323000              -9.60452   
16  7.938134  12.42380           0.709549  0.374027              36.53140   
17  8.183456  12.73490           1.439180  0.679627             -12.94420   
18  8.322246   9.61600           0.551689  0.642900              37.50000   
19  7.934997   7.77564           0.519255  0.690936             -25.29880   
20  9.049387  11.16000           0.519255  0.789064             -35.73880   
21  8.071323   6.17036           0.432980  0.574355             -22.43590   
22  6.418345   5.98927           0.432980  0.584991               4.34783   
23  7.950516   5.49527           0.422882  0.689009              25.22520   
24  6.355529   7.35982           0.432980  0.419045             -18.81920   
25  8.043683   5.18300           0.763596  0.582555              50.56180   
26  6.013468   5.34018           0.493781  0.241155              -3.01205   
27  7.961675   5.43264           0.493781  0.421527             -21.72290   
28  8.074614  11.94630           0.493781  0.451800              11.61620   
29  8.370570   6.34100           0.492384  0.550127             -12.50000   

    Pct_Pima  Sand._15cm  
0   75.62120     44.6667  
1   69.30690     41.8333  
2   59.47490     41.8333  
3   66.08800     41.5000  
4   34.31250     39.6667  
5   35.04750     39.2424  
6   62.32120     41.6667  
7   57.14320     43.3333  
8   57.35020     43.3333  
9   72.90980     41.0000  
10  57.61790     38.8333  
11  57.35020     39.8333  
12  69.30690     47.8333  
13  69.30690     47.3333  
14  76.58910     42.8333  
15  75.62120     45.3333  
16  76.69440     41.7727  
17  59.47090     37.8333  
18  61.10130     42.8333  
19  72.67650     38.1818  
20  57.35020     40.6667  
21  23.15380     48.0000  
22  17.15050     51.5000  
23   0.00000     47.5000  
24   6.67001     58.0000  
25  15.18050     54.8333  
26   5.89344     49.0000  
27   5.89344     49.1667  
28  13.18900     48.5000  
29  13.30450     49.0000 

I want to run a linear model through 10-fold cross validation repeated 100 times.

In python I do this:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import RepeatedKFold
from sklearn.metrics import r2_score

X = df[['CDEM_TWI', 'Gruber_Ruggedness', 'dNBR', 'TC_Change_Sexton_Rel', 'Pct_Pima', 'Sand._15cm']].copy()
y = df[['log.comb']].copy()

all_r2 = []
rskf = RepeatedKFold(n_splits=10, n_repeats=10, random_state=42)
for train_index, test_index in rskf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    lm = LinearRegression(fit_intercept = True)
    lm.fit(X_train, y_train)
    pred = lm.predict(X_test)
    r2 = r2_score(y_test, pred)
    all_r2.append(r2)

avg = np.mean(all_r2)

and here avg returns -0.11

In R I do this:

library(caret)
library(klaR)

train_control <- trainControl(method="repeatedcv", number=10, repeats=10)
model <- train(log.comb~., data=df, trControl=train_control, method="lm")

and model returns:

RMSE       Rsquared   MAE      
0.7868838  0.6132806  0.7047198

I am curious why these results are so inconsistent with each other? I realize the folds between the two different't languages are different, but since I am repeating it so many times I don't get why the numbers aren't more similar.

I also tried a nested grid search in sklearn like so:

inner_cv = KFold(n_splits=10, shuffle=True, random_state=10)

outer_cv = KFold(n_splits=10, shuffle=True, random_state=10)

param_grid = {'fit_intercept': [True, False],
                  'normalize': [True, False]}

# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=LinearRegression(), param_grid = param_grid, cv=inner_cv)
clf.fit(X, y)
non_nested_score = clf.best_score_

# Pass the gridSearch estimator to cross_val_score
clf = GridSearchCV(estimator= LinearRegression(), param_grid = param_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X, y=y, cv=outer_cv).mean()

but both the nested_score and non_nested_score are both still negative.

Upvotes: 1

Views: 104

Answers (1)

Carlos Santillan
Carlos Santillan

Reputation: 1087

The Python code is returning the average of the results, the R code is returning the best model found.

Upvotes: 1

Related Questions