Reputation: 1637
I have the following code snippet from a program called Flights.py
...
#Load the Dataset
df = dataset
df.isnull().any()
df = df.fillna(lambda x: x.median())
# Define X and Y
X = df.iloc[:, 2:124].values
y = df.iloc[:, 136].values
X_tolist = X.tolist()
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
The second to last line is throwing the following error:
Traceback (most recent call last):
File "<ipython-input-14-d4add2ccf5ab>", line 3, in <module>
X_train = sc.fit_transform(X_train)
File "/Users/<username>/anaconda/lib/python3.6/site-packages/sklearn/base.py", line 494, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/Users/<username>/anaconda/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 560, in fit
return self.partial_fit(X, y)
File "/Users/<username>/anaconda/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 583, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "/Users/<username>/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
TypeError: float() argument must be a string or a number, not 'function'
My dataframe df
is of size (22587, 138)
I was taking a look at the following question for inspiration:
TypeError: float() argument must be a string or a number, not 'method' in Geocoder
I tried the following adjustment:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train.as_matrix)
X_test = sc.transform(X_test.as_matrix)
Which resulted in the following error:
AttributeError: 'numpy.ndarray' object has no attribute 'as_matrix'
I'm currently at a loss for how to scan thru the dataframe and find/convert the offending entries.
Upvotes: 8
Views: 34615
Reputation: 1470
I had the same troubles using df = df.fillna(lambda x: x.median())
Here is my solution to get true values rather than 'function' into dataframe:
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
I create dataframe 10 lines, 3 colunms with nan
df = pd.DataFrame(np.random.randint(100,size=(10,3)))
df.iloc[3:5,0] = np.nan
df.iloc[4:6,1] = np.nan
df.iloc[5:8,2] = np.nan
Attribute stupid column labels for convenience afterward
df.columns=['Number_of_Holy_Hand_Grenades_of_Antioch', 'Number_of_knight_fleeings', 'Number_of_rabbits_of_Caerbannog']
print df.isnull().any() # tell if nan per column
For each Column through their labels, we fill all the nan value by median value computed on the column itself. Can be used with mean(), etc.
for i in df.columns: #df.columns[w:] if you have w column of line description
df[i] = df[i].fillna(df[i].median() )
print df.isnull().any()
Now df contains nan replaced by median value
print df
you can do for example
X = df.ix[:,:].values
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
which doesn't work with df = df.fillna(lambda x: x.median())
We can now use df into forward method because all values are true values, not function; contrary to method using lambda into dataframe.fillna() like e.g., all proposals using fillna combined to lambda
Upvotes: 0
Reputation: 402363
As this answer explains, fillna
isn't designed to work with a callback. If you pass one, it will be taken as the literal fill value, meaning your NaN
s will be replaced with lambdas:
df
col1 col2 col3 col4
row1 65.0 24 47.0 NaN
row2 33.0 48 NaN 89.0
row3 NaN 34 67.0 NaN
row4 24.0 12 52.0 17.0
df4.fillna(lambda x: x.median())
col1 col2 \
row1 65 24
row2 33 48
row3 <function <lambda> at 0x10bc47730> 34
row4 24 12
col3 col4
row1 47 <function <lambda> at 0x10bc47730>
row2 <function <lambda> at 0x10bc47730> 89
row3 67 <function <lambda> at 0x10bc47730>
row4 52 17
If you are trying to fill by median, the solution would be to create a dataframe of medians based on the column, and pass that to fillna
.
df
col1 col2 col3 col4
row1 65.0 24 47.0 NaN
row2 33.0 48 NaN 89.0
row3 NaN 34 67.0 NaN
row4 24.0 12 52.0 17.0
df.fillna(df.median())
df
col1 col2 col3 col4
row1 65.0 24 47.0 53.0
row2 33.0 48 52.0 89.0
row3 33.0 34 67.0 53.0
row4 24.0 12 52.0 17.0
Upvotes: 4
Reputation: 2520
df = df.fillna(lambda x: x.median())
This is not really a valid way of using fillna
. It expects literal values here, or a mapping from column to literal values. It will not apply the function you've provided; instead the value of NA cells will simply be set to the function itself. This is the function that your estimator is attempting to turn into a float.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
Upvotes: 0