Sklearn Error:R^2 score is not well-defined with less than two samples

I built a data.csv in order to see how this works. So I did this one.

s1;s2;s3;s4;result
1;2;3;4;5;15
2;1;3;1;2;9
19;21;0;0;0;40
11;9;0;1;5;26
5;5;5;5;5;25
80;1;1;1;1;84
1;2;3;1;1;8
1;0;0;1;1;3
10;10;10;10;20;60

As you can see, result is the sum of s1,s2,s3 and s4. So I did this.

data = pd.read_csv('example.csv', sep=';', index_col=0)
data = data[['s1', 's2', 's3', 's4', 'result']]
predict = 'result'


x = np.array(data.drop([predict], 1))
y = np.array(data[predict])

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)

linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)

acc = linear.score(x_test, y_test)
print(acc)

But I got this error.

C:\Users\Sharki\Anaconda3\lib\site-packages\sklearn\metrics\regression.py:543: UndefinedMetricWarning: R^2 score is not well-defined with less than two samples. warnings.warn(msg, UndefinedMetricWarning)

What's happening?

Upvotes: 0

Views: 3669

Answers (1)

desertnaut
desertnaut

Reputation: 60318

The reason is that you have asked for too low a test_size in your train_test_split; test_size=0.1, in your dataset of only 10 rows, corresponds to a single data point in your test set:

x_test, y_test
# (array([[2, 3, 4, 5]]), array([15]))

hence the error (actually a warning, since nan was returned).

Change it to at least 0.2:

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
[...]
print(acc)
# -88.65298209559413

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
[...]
print(acc)
# -0.5210446297916358

and prepare for wild fluctuations in your resulting R^2 (as already shown in the examples above), due to the extremely small size of your data.

Additionally, notice that you actually have 6 fields in your CSV file but only 5 column names, the result being pandas interpreting the first column as index when reading the dataframe (notice that 1 is missing from the x_test variable shown above). You should add an s5 in the header and remove index_col=0, i.e.:

data = pd.read_csv('example.csv', sep=';')

Visually inspecting your variables, especially in case of errors, always pays off.

Irrelevant to your question, but to term the R^2 as acc (implying accuracy) is not good practice, since R2 & accuracy are different performance measures, and accuracy is meaningless in regression problems (it's only meaningful in classification).

Upvotes: 1

Related Questions