Reputation: 69
I built a data.csv in order to see how this works. So I did this one.
s1;s2;s3;s4;result
1;2;3;4;5;15
2;1;3;1;2;9
19;21;0;0;0;40
11;9;0;1;5;26
5;5;5;5;5;25
80;1;1;1;1;84
1;2;3;1;1;8
1;0;0;1;1;3
10;10;10;10;20;60
As you can see, result
is the sum of s1,s2,s3 and s4.
So I did this.
data = pd.read_csv('example.csv', sep=';', index_col=0)
data = data[['s1', 's2', 's3', 's4', 'result']]
predict = 'result'
x = np.array(data.drop([predict], 1))
y = np.array(data[predict])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print(acc)
But I got this error.
C:\Users\Sharki\Anaconda3\lib\site-packages\sklearn\metrics\regression.py:543: UndefinedMetricWarning: R^2 score is not well-defined with less than two samples. warnings.warn(msg, UndefinedMetricWarning)
What's happening?
Upvotes: 0
Views: 3669
Reputation: 60318
The reason is that you have asked for too low a test_size
in your train_test_split
; test_size=0.1
, in your dataset of only 10 rows, corresponds to a single data point in your test set:
x_test, y_test
# (array([[2, 3, 4, 5]]), array([15]))
hence the error (actually a warning, since nan
was returned).
Change it to at least 0.2:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
[...]
print(acc)
# -88.65298209559413
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
[...]
print(acc)
# -0.5210446297916358
and prepare for wild fluctuations in your resulting R^2 (as already shown in the examples above), due to the extremely small size of your data.
Additionally, notice that you actually have 6 fields in your CSV file but only 5 column names, the result being pandas interpreting the first column as index when reading the dataframe (notice that 1
is missing from the x_test
variable shown above). You should add an s5
in the header and remove index_col=0
, i.e.:
data = pd.read_csv('example.csv', sep=';')
Visually inspecting your variables, especially in case of errors, always pays off.
Irrelevant to your question, but to term the R^2 as acc
(implying accuracy) is not good practice, since R2 & accuracy are different performance measures, and accuracy is meaningless in regression problems (it's only meaningful in classification).
Upvotes: 1