Aayush Motiani
Aayush Motiani

Reputation: 25

Cleaning Data in CSV file for ML Model

I'm trying to clean my data in jupyterlab by watching several tutorials, but I keep getting one or the other error every time. So I thought I'd come on stack overflow and ask if someone can help me.

This is the csv file I want to clean: https://1drv.ms/u/s!AvOXB8kb-IHBgjaveis044GVoPpk

I'm building a machine learning model so I want to convert all the object values, but I don't know how to.

EDIT: I tried cleaning the data from scratch.

My code input:

    import pandas as pd
    from sklearn.tree import DecisionTreeClassifier
    criminal_data = pd.read_csv('database2.csv')
    X = criminal_data.drop(columns=['Agency Type', 'City', 'State', 
    'Crime Solved'])
    y = criminal_data['City']
    model = DecisionTreeClassifier()
    model.fit(X, y)
    criminal_data

The error message:


    ValueError                                Traceback (most recent call 
    last)
    <ipython-input-117-4b6968f9994f> in <module>
          6 y = criminal_data['City']
          7 model = DecisionTreeClassifier()
    ----> 8 model.fit(X, y)
          9 criminal_data

    ~\anaconda3\lib\site-packages\sklearn\tree\_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
        896         """
        897 
    --> 898         super().fit(
        899             X, y,
    900             sample_weight=sample_weight,

    ~\anaconda3\lib\site-packages\sklearn\tree\_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
        154             check_X_params = dict(dtype=DTYPE, accept_sparse="csc")
        155             check_y_params = dict(ensure_2d=False, dtype=None)
    --> 156             X, y = self._validate_data(X, y,
        157                                        validate_separately=(check_X_params,
        158                                                             check_y_params))

    ~\anaconda3\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
        428                 # :(
        429                 check_X_params, check_y_params = 
    validate_separately
    --> 430                 X = check_array(X, **check_X_params)
        431                 y = check_array(y, **check_y_params)
        432             else:

    ~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
         61             extra_args = len(args) - len(all_args)
         62             if extra_args <= 0:
    ---> 63                 return f(*args, **kwargs)
         64 
         65             # extra_args > 0

    ~\anaconda3\lib\site-packages\sklearn\utils\validation.py in 
    check_array(array, accept_sparse, accept_large_sparse, dtype, order, 
    copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
        614                     array = array.astype(dtype, casting="unsafe", 
    copy=False)
        615                 else:
    --> 616                     array = np.asarray(array, order=order, dtype=dtype)
        617             except ComplexWarning as complex_warning:
        618                 raise ValueError("Complex data not supported\n"

    ~\anaconda3\lib\site-packages\numpy\core\_asarray.py in asarray(a, dtype, order, like)
        100         return _asarray_with_like(a, dtype=dtype, order=order, 
    like=like)
        101 
    --> 102     return array(a, dtype, copy=False, order=order)
        103 
        104 

    ~\anaconda3\lib\site-packages\pandas\core\generic.py in __array__(self, dtype)
       1897 
       1898     def __array__(self, dtype=None) -> np.ndarray:
    -> 1899         return np.asarray(self._values, dtype=dtype)
       1900 
       1901     def __array_wrap__(

    ~\anaconda3\lib\site-packages\numpy\core\_asarray.py in asarray(a, dtype, 
    order, like)
        100         return _asarray_with_like(a, dtype=dtype, order=order, 
    like=like)
        101 
    --> 102     return array(a, dtype, copy=False, order=order)
        103 
        104 

    ValueError: could not convert string to float: 'Anchorage'

Upvotes: 0

Views: 294

Answers (1)

Muhteva
Muhteva

Reputation: 2832

You are trying to train your model with some data that is not numerical. Before using the model, you need to do encoding. You can try LabelEncoder for that.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for column_name in X.columns:
    if X[column_name].dtype == object:
         X[column_name] = le.fit_transform(X[column_name])
    else:
         pass

If you have a combination of different data types in a row. Try below:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for column_name in X.columns:
    X[column_name] = X[column_name].replace(np.nan, 'none', regex=True)
    X[column_name] = le.fit_transform(X[column_name].astype(str))

Upvotes: 1

Related Questions