user10097913
user10097913

Reputation:

How can I find value errors in a data set?

I am trying to scale data from a .csv table to a range between 0 and 1. Multiple times already I received the error that the input data contains NaN, infinity or a value too large.

"ValueError: Input contains NaN, infinity or a value too large for dtype('float64')."

Until now I was always able to figure out where the error came from, e.g. an empty cell, sometimes blanks in the table or characters that were not UTF-8 compatible. Until now I was always able to make it work.

This time I received the error again, but I am not able to find the error. Is there a way to find out which data point is "NaN, infinity or a value too large"? Since I have many data-points, I cannot go through it manually. If you have a suggestion I would be very happy - even if it is just a trick in Excel to find the values causing the errors. Below you can find my code and the error. Unfortunately I cannot provide the data set, since it contains confidential information.

Code:

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load training data set from CSV file
training_data_df = pd.read_csv("mtth_train.csv")

# Load testing data set from CSV file
test_data_df = pd.read_csv("mtth_test.csv")

# Data needs to be scaled to a small range like 0 to 1 
scaler = MinMaxScaler(feature_range= (0, 1))

# Scale both the training inputs and outputs
scaled_training = scaler.fit_transform(training_data_df)
scaled_testing = scaler.transform(test_data_df)

# Print out the adjustment that the scaler applied to the total_earnings column of data
print("Note: Parameters were scaled by multiplying by {:.10f} and adding {:.6f}".format(scaler.scale_[8], scaler.min_[8]))

# Create new pandas DataFrame objects from the scaled data
scaled_training_df = pd.DataFrame(scaled_training, columns=training_data_df.columns.values)
scaled_testing_df = pd.DataFrame(scaled_testing, columns=test_data_df.columns.values)

# Save scaled data dataframes to new CSV files
scaled_training_df.to_csv("mtth_train_scaled", index=False)
scaled_testing_df.to_csv("mtth_test_scaled.csv", index=False)

Error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-4e3503c96698> in <module>()
     14 # Scale both the training inputs and outputs
     15 scaled_training = scaler.fit_transform(training_data_df)
---> 16 scaled_testing = scaler.transform(test_data_df)
     17 
     18 # Print out the adjustment that the scaler applied to the total_earnings column of data

~/anaconda3_501/lib/python3.6/site-packages/sklearn/preprocessing/data.py in transform(self, X)
    365         check_is_fitted(self, 'scale_')
    366 
--> 367         X = check_array(X, copy=self.copy, dtype=FLOAT_DTYPES)
    368 
    369         X *= self.scale_

~/anaconda3_501/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    451                              % (array.ndim, estimator_name))
    452         if force_all_finite:
--> 453             _assert_all_finite(array)
    454 
    455     shape_repr = _shape_repr(array.shape)

~/anaconda3_501/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
     42             and not np.isfinite(X).all()):
     43         raise ValueError("Input contains NaN, infinity"
---> 44                          " or a value too large for %r." % X.dtype)
     45 
     46 

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Upvotes: 0

Views: 1966

Answers (2)

Bhuwan Bhatt
Bhuwan Bhatt

Reputation: 126

use

df.isnull().sum() 

to know the total number of missing values in each column

Upvotes: 0

mad_
mad_

Reputation: 8273

import numpy as np
indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
df[indices_to_keep]

In case you need to find how many values are NA or inf

from collections import Counter
Counter(indices_to_keep)

You can also follow docs here for missing data https://pandas.pydata.org/pandas-docs/stable/missing_data.html

As per the docs we can set the options for inf values to be treated as NA

pandas.options.mode.use_inf_as_na = True

Then we can just look for NA values.

import pandas as pd
pd.isna(df)

Upvotes: 1

Related Questions