user4045430
user4045430

Reputation: 217

Python Pandas Dataframe fill NaN values

I am trying to fill NaN values in a dataframe with values coming from a standard normal distribution. This is currently my code:

 sqlStatement = "select * from sn.clustering_normalized_dataset"
 df = psql.frame_query(sqlStatement, cnx)
 data=df.pivot("user","phrase","tfw")
 dfrand = pd.DataFrame(data=np.random.randn(data.shape[0],data.shape[1]))
 data[np.isnan(data)] = dfrand[np.isnan(data)]

After pivoting the dataframe 'data' it looks like that:

phrase      aaron  abbas  abdul       abe  able  abroad       abu     abuse  \
user                                                                          
14233664      NaN    NaN    NaN       NaN   NaN     NaN       NaN       NaN   
52602716      NaN    NaN    NaN       NaN   NaN     NaN       NaN       NaN   
123456789     NaN    NaN    NaN       NaN   NaN     NaN       NaN       NaN   
500158258     NaN    NaN    NaN       NaN   NaN     NaN       NaN       NaN   
517187571     0.4    NaN    NaN  0.142857     1     0.4  0.181818       NaN  

However, I need that each NaN value will be replaced with a new random value. So I created a new df consists of only random values (dfrand) and then trying to swap the missing numbers (Nan) by the values from dfrand corresponding to indices of the NaN's. Well - unfortunately it doesn't work - Although the expression

 np.isnan(data)

returns a dataframe consists of True and False values, the expression

  dfrand[np.isnan(data)]

return only NaN values so the overall trick doesn't work. Any ideas what the issue ?

Upvotes: 6

Views: 5047

Answers (2)

tnknepp
tnknepp

Reputation: 6263

Three-thousand columns is not so many. How many rows do you have? You could always make a random dataframe of the same size and do a logical replacement (the size of your dataframe will dictate whether this is feasible or not.

if you know the size of your dataframe:

import pandas as pd
import numpy as np

# create random dummy dataframe
dfrand = pd.DataFrame(data=np.random.randn(rows,cols))

# import "real" dataframe
data = pd.read_csv(etc.) # or however you choose to read it in

# replace nans
data[np.isnan(data)] = dfrand[np.isnan(data)]

if you do not know the size of your dataframe, just shuffle things around

import pandas as pd
import numpy as np



# import "real" dataframe
data = pd.read_csv(etc.) # or however you choose to read it in

# create random dummy dataframe
dfrand = pd.DataFrame(data=np.random.randn(data.shape[0],data.shape[1]))

# replace nans
data[np.isnan(data)] = dfrand[np.isnan(data)]

EDIT Per "users" last comment: "dfrand[np.isnan(data)] returns NaN only."

Right! And that is exactly what you wanted. In my solution I have: data[np.isnan(data)] = dfrand[np.isnan(data)]. Translated, this means: take the randomly-generated value from dfrand that corresponds to the NaN-location within "data" and insert it in "data" where "data" is NaN. An example will help:

a = pd.DataFrame(data=np.random.randint(0,100,(10,3)))
a[0][5] = np.nan

In [32]: a
Out[33]: 
    0   1   2
0   2  26  28
1  14  79  82
2  89  32  59
3  65  47  31
4  29  59  15
5 NaN  58  90
6  15  66  60
7  10  19  96
8  90  26  92
9   0  19  23

# define randomly-generated dataframe, much like what you are doing, and replace NaN's
b = pd.DataFrame(data=np.random.randint(0,100,(10,3)))

In [39]: b
Out[39]: 
    0   1   2
0  92  21  55
1  65  53  89
2  54  98  97
3  48  87  79
4  98  38  62
5  46  16  30
6  95  39  70
7  90  59   9
8  14  85  37
9  48  29  46


a[np.isnan(a)] = b[np.isnan(a)]

In [38]: a
Out[38]: 
    0   1   2
0   2  26  28
1  14  79  82
2  89  32  59
3  65  47  31
4  29  59  15
5  46  58  90
6  15  66  60
7  10  19  96
8  90  26  92
9   0  19  23

As you can see, all NaN's in have been replaced with the randomly-generated value in based on 's nan-value indices.

Upvotes: 5

acushner
acushner

Reputation: 9946

you could try something like this, assuming you are dealing with one series:

ser = data['column_with_nulls_to_replace']
index = ser[ser.isnull()].index
df = pd.DataFrame(np.random.randn(len(index)), index=index, columns=['column_with_nulls_to_replace'])
ser.update(df)

Upvotes: 0

Related Questions