Reputation: 19
I did a toy kaggle "Titanic" dataset training and followed the instruction of a linkedin video course named" Applied Machine Learning: Algorithms" when I typed the following codes: the error occured:
lr = LogisticRegression()
parameters = {
'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
cv = GridSearchCV(lr, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())
print_results(cv)
ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_7740/1127236825.py in <module>
5
6 cv = GridSearchCV(lr, parameters, cv=5)
7 cv.fit(tr_features, tr_labels.values.ravel())
8
9 print_results(cv)
~\AppData\Local\miniforge3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
63 return f(*args, **kwargs)
64
65 # extra_args > 0
~\AppData\Local\miniforge3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
757 refit_metric = self.refit
758
759 X, y, groups = indexable(X, y, groups)
760 fit_params = _check_fit_params(X, fit_params)
761
~\AppData\Local\miniforge3\lib\site-packages\sklearn\utils\validation.py in indexable(*iterables)
354 """
355 result = [_make_indexable(X) for X in iterables]
356 check_consistent_length(*result)
357 return result
358
~\AppData\Local\miniforge3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
317 uniques = np.unique(lengths)
318 if len(uniques) > 1:
319 raise ValueError("Found input variables with inconsistent numbers of"
320 " samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [534, 535]
How to resolve this?
Upvotes: 1
Views: 161
Reputation: 409
From the exercise file 02_04.ipynb (download from the LinkedIn Learning course), notice the last line in the first code block:
tr_labels = pd.read_csv('../../../train_labels.csv', header=None)
import joblib
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)
tr_features = pd.read_csv('../../../train_features.csv')
tr_labels = pd.read_csv('../../../train_labels.csv', header=None)
Remove the header=None
parameter from the last line. Your code block should now look like the following:
import joblib
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)
tr_features = pd.read_csv('../../../train_features.csv')
tr_labels = pd.read_csv('../../../train_labels.csv')
I found a community answer in the Q&A tab from the LinkedIn Learning course. Below the screenshot I have pasted an explanation from one Maxwell Bauer:
It looks like the issue comes from the reading of train_labels.csv. If the header parameter is defined as 'None' in 'pd.read_csv('train_labels.csv, header = None)', then the actual headers of train_labels.csv are not read as headers and are instead interpreted as the 0th row. If you remove the header parameter definition and make the read function 'pd.read_csv('train_labels.csv')' it seems to solve the problem because now the function is using its default interpretation of headers (infer). This default makes it so that the headers of train_labels.csv are interpreted as headers. Here is the proposed solution: tr_features = pd.read_csv('train_features.csv') tr_labels = pd.read_csv('train_labels.csv') For what is worth, I diagnosed the issue by looking at how the csvs were being handled: I read the files using the above solution, ran 'tr_features' in a new cell, and then ran 'tr_labels' in another new cell. Running tr_features and tr_labels in different cells will show how the program interprets the csvs. As a result, I could see that the csvs' headers were being interpreted as headers with the proposed solution. If you then add the header parameter definition (header = None) into tr_labels = pd.read_csv('train_labels.csv') and then run 'tr_labels', you can see that the header 'Survived' is now in the 0th row and not the table header. As a result, the length of the 'tr_labels' column vector is extended by one (making it have 535 rows) as opposed to the 'tr_features' matrix (which has 534 rows) - therefore creating the inconsistency. Let me know if this solution worked for you
Upvotes: 0