Victor
Victor

Reputation: 19

Found input variables with inconsistent number of samples when I processed "titanic" dataset in kaggle

I did a toy kaggle "Titanic" dataset training and followed the instruction of a linkedin video course named" Applied Machine Learning: Algorithms" when I typed the following codes: the error occured:

lr = LogisticRegression()
parameters = { 
'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}

cv = GridSearchCV(lr, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

print_results(cv)
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_7740/1127236825.py in <module>

      5 
      6 cv = GridSearchCV(lr, parameters, cv=5)
      7 cv.fit(tr_features, tr_labels.values.ravel())
      8 
      9 print_results(cv)

~\AppData\Local\miniforge3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)

     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
     63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

~\AppData\Local\miniforge3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)

    757             refit_metric = self.refit
    758 
    759         X, y, groups = indexable(X, y, groups)
    760         fit_params = _check_fit_params(X, fit_params)
    761 

~\AppData\Local\miniforge3\lib\site-packages\sklearn\utils\validation.py in indexable(*iterables)
    
    354     """
    355     result = [_make_indexable(X) for X in iterables]
    356     check_consistent_length(*result)
    357     return result
    358 

~\AppData\Local\miniforge3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)

    317     uniques = np.unique(lengths)
    318     if len(uniques) > 1:
    319         raise ValueError("Found input variables with inconsistent numbers of"
    320                          " samples: %r" % [int(l) for l in lengths])
   

    ValueError: Found input variables with inconsistent numbers of samples: [534, 535]

How to resolve this?

Upvotes: 1

Views: 161

Answers (1)

aaronkelton
aaronkelton

Reputation: 409

Problem Code

From the exercise file 02_04.ipynb (download from the LinkedIn Learning course), notice the last line in the first code block:

tr_labels = pd.read_csv('../../../train_labels.csv', header=None)

import joblib
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

tr_features = pd.read_csv('../../../train_features.csv')
tr_labels = pd.read_csv('../../../train_labels.csv', header=None)

Solution

Remove the header=None parameter from the last line. Your code block should now look like the following:

import joblib
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

tr_features = pd.read_csv('../../../train_features.csv')
tr_labels = pd.read_csv('../../../train_labels.csv')

Source

I found a community answer in the Q&A tab from the LinkedIn Learning course. Below the screenshot I have pasted an explanation from one Maxwell Bauer:

screenshot of the Q&A tab from LinkedIn Learning course

It looks like the issue comes from the reading of train_labels.csv. If the header parameter is defined as 'None' in 'pd.read_csv('train_labels.csv, header = None)', then the actual headers of train_labels.csv are not read as headers and are instead interpreted as the 0th row. If you remove the header parameter definition and make the read function 'pd.read_csv('train_labels.csv')' it seems to solve the problem because now the function is using its default interpretation of headers (infer). This default makes it so that the headers of train_labels.csv are interpreted as headers. Here is the proposed solution: tr_features = pd.read_csv('train_features.csv') tr_labels = pd.read_csv('train_labels.csv') For what is worth, I diagnosed the issue by looking at how the csvs were being handled: I read the files using the above solution, ran 'tr_features' in a new cell, and then ran 'tr_labels' in another new cell. Running tr_features and tr_labels in different cells will show how the program interprets the csvs. As a result, I could see that the csvs' headers were being interpreted as headers with the proposed solution. If you then add the header parameter definition (header = None) into tr_labels = pd.read_csv('train_labels.csv') and then run 'tr_labels', you can see that the header 'Survived' is now in the 0th row and not the table header. As a result, the length of the 'tr_labels' column vector is extended by one (making it have 535 rows) as opposed to the 'tr_features' matrix (which has 534 rows) - therefore creating the inconsistency. Let me know if this solution worked for you

Upvotes: 0

Related Questions