JACK
JACK

Reputation: 484

When i convert my numpy array to Dataframe it update values to Nan

import impyute.imputation.cs as imp

print(Data)
Data = pd.DataFrame(data = imp.em(Data),columns = columns)
print(Data)

When i do the above code all my values gets converted to Nan as below,Can someone help me where am i going wrong?

Before

     Time  LymphNodeStatus    ...      MeanPerimeter  TumorSize
0      31              5.0    ...             117.50        5.0
1      61              2.0    ...             122.80        3.0
2     116              0.0    ...             137.50        2.5
3     123              0.0    ...              77.58        2.0
4      27              0.0    ...             135.10        3.5
5      77              0.0    ...              84.60        2.5

After

     Time  LymphNodeStatus    ...      MeanPerimeter  TumorSize
0     NaN              NaN    ...                NaN        NaN
1     NaN              NaN    ...                NaN        NaN
2     NaN              NaN    ...                NaN        NaN
3     NaN              NaN    ...                NaN        NaN
4     NaN              NaN    ...                NaN        NaN
5     NaN              NaN    ...                NaN        NaN

Upvotes: 2

Views: 3325

Answers (3)

Chris
Chris

Reputation: 29742

Editted

Solution first

Instead of passing columns to pd.DataFrame, just manually assign column names:

data = pd.DataFrame(imp.em(data))
data.columns = columns

Cause

Error lies in Data = pd.DataFrame(data = imp.em(Data),columns = columns).

imp.em has a decorator @preprocess which converts input into a numpy.array if it is a pandas.DataFrame.

...
if pd_DataFrame and isinstance(args[0], pd_DataFrame):
    args[0] = args[0].as_matrix()
    return pd_DataFrame(fn(*args, **kwargs))

It therefore returns a dataframe reconstructed from a matrix, having range(data.shape[1]) as column names.

And as I have pointed below, when pd.DataFrame is instantiated with mismatching columns on another pd.DataFrame, all the contents become NaN.

You can test this by

from impyute.util import preprocess

@preprocess
def test(data):
    return data

data = pd.DataFrame({"time": [1,2,3], "size": [3,2,1]})
columns = data.columns

data = pd.DataFrame(test(data), columns = columns))

size    time
0   NaN NaN
1   NaN NaN
2   NaN NaN

When you instantiate a pd.DataFrame from an existing pd.DataFrame, columns argument specifies which of the columns from original dataframe you want to use.

It does not re-label the dataframe. Which is not odd, just the way pandas intended in reindexing

By default values in the new index that do not have corresponding records in the dataframe are assigned NaN.

# Make new pseudo dataset
data = pd.DataFrame({"time": [1,2,3], "size": [3,2,1]})
data
    size    time
0   3   1
1   2   2
2   1   3

#Make new dataset with original `data`
data = pd.DataFrame(data, columns = ["a", "b"])
data
a   b
0   NaN NaN
1   NaN NaN
2   NaN NaN

Upvotes: 4

JACK
JACK

Reputation: 484

Data = pd.DataFrame(data = np.array(imp.em(Data)),columns = columns)

Doing this solved the issue i was facing, i guess the data after the use of em function doesn't return numpy array.

Upvotes: 0

Ankish Bansal
Ankish Bansal

Reputation: 1902

There may be some bug in impyute library. You are using em function which is nothing but a way to fill-missing values by expectation-maximization algorithm. You can try without using that function, as

df = pd.DataFrame(data = Data ,columns = columns)

You can raise this issue here after confirming. To confirm first load the data, using above example and find if there are null data present in the data by using df.isnull() method.

Upvotes: 0

Related Questions