Reputation: 484
import impyute.imputation.cs as imp
print(Data)
Data = pd.DataFrame(data = imp.em(Data),columns = columns)
print(Data)
When i do the above code all my values gets converted to Nan as below,Can someone help me where am i going wrong?
Before
Time LymphNodeStatus ... MeanPerimeter TumorSize
0 31 5.0 ... 117.50 5.0
1 61 2.0 ... 122.80 3.0
2 116 0.0 ... 137.50 2.5
3 123 0.0 ... 77.58 2.0
4 27 0.0 ... 135.10 3.5
5 77 0.0 ... 84.60 2.5
After
Time LymphNodeStatus ... MeanPerimeter TumorSize
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN NaN
4 NaN NaN ... NaN NaN
5 NaN NaN ... NaN NaN
Upvotes: 2
Views: 3325
Reputation: 29742
Editted
Solution first
Instead of passing columns
to pd.DataFrame
, just manually assign column names:
data = pd.DataFrame(imp.em(data))
data.columns = columns
Cause
Error lies in Data = pd.DataFrame(data = imp.em(Data),columns = columns)
.
imp.em
has a decorator @preprocess
which converts input into a numpy.array
if it is a pandas.DataFrame
.
...
if pd_DataFrame and isinstance(args[0], pd_DataFrame):
args[0] = args[0].as_matrix()
return pd_DataFrame(fn(*args, **kwargs))
It therefore returns a dataframe
reconstructed from a matrix, having range(data.shape[1])
as column names.
And as I have pointed below, when pd.DataFrame
is instantiated with mismatching columns
on another pd.DataFrame
, all the contents become NaN
.
You can test this by
from impyute.util import preprocess
@preprocess
def test(data):
return data
data = pd.DataFrame({"time": [1,2,3], "size": [3,2,1]})
columns = data.columns
data = pd.DataFrame(test(data), columns = columns))
size time
0 NaN NaN
1 NaN NaN
2 NaN NaN
When you instantiate a pd.DataFrame
from an existing pd.DataFrame
, columns
argument specifies which of the columns from original dataframe you want to use.
It does not re-label the dataframe. Which is not odd, just the way pandas
intended in reindexing
By default values in the new index that do not have corresponding records in the dataframe are assigned NaN.
# Make new pseudo dataset
data = pd.DataFrame({"time": [1,2,3], "size": [3,2,1]})
data
size time
0 3 1
1 2 2
2 1 3
#Make new dataset with original `data`
data = pd.DataFrame(data, columns = ["a", "b"])
data
a b
0 NaN NaN
1 NaN NaN
2 NaN NaN
Upvotes: 4
Reputation: 484
Data = pd.DataFrame(data = np.array(imp.em(Data)),columns = columns)
Doing this solved the issue i was facing, i guess the data after the use of em
function doesn't return numpy array.
Upvotes: 0
Reputation: 1902
There may be some bug in impyute
library. You are using em
function which is nothing but a way to fill-missing
values by expectation-maximization
algorithm. You can try without using that function, as
df = pd.DataFrame(data = Data ,columns = columns)
You can raise this issue here after confirming. To confirm first load the data, using above example and find if there are null data present in the data by using df.isnull()
method.
Upvotes: 0