Reputation: 2087
I have a pandas dataframe , where all missing values are np.nan, now I am trying to replace these missing values. The last column of my data is " class" , I need to group the data based on the class, then get mean/median/mode (based on data whether data is categorical/ continuos, normal/ not) of that group of a column and replace missing values of the group of the coulmn by respective mean/median/mode.
This is the code I have come up with , which I know is an overkill.. if I could :
it would be great.
but currently I landed up , finding replacement values (mean/median/mode) group wise and storing in dict, then seperating the nan tuples and non-nan tuples.. replacing missing values in nan tuples.. and trying to join them back to dataframe (which i donno yet how to do)
def fillMissing(df, dataType):
'''
Args:
df ( 2d array/ Dict):
eg : ('attribute1': [12, 24, 25] , 'attribute2': ['good', 'bad'])
dataTypes (dict): Dictionary of attribute names of df as keys and values 0/1
indicating categorical/continuous variable eg: ('attribute1':1, 'attribute2': 0)
Returns:
dataframe wih missing values filled
writes a file with missing values replaces.
'''
dataLabels = list(df.columns.values)
# the dictionary to hold the values to put in place of nan
replaceValues = {}
for eachlabel in dataLabels:
thisSer = df[eachlabel]
if dataType[eachlabel] == 1: # if its a continuous variable
_,pval = stats.normaltest(thisSer)
groupedd = thisSer.groupby(df['class'])
innerDict ={}
for name, group in groupedd:
if(pval < 0.5):
groupMiddle = group.median() # get the median of the group
else:
groupMiddle = group.mean() # get mean (if group is normal )
innerDict[name.strip()] = groupMiddle
replaceValues[eachlabel] = innerDict
else: # if the series is continuous
# freqCount = collections.Counter(thisSer)
groupedd = thisSer.groupby(df['class'])
innerDict ={}
for name, group in groupedd:
freqC = collections.Counter(group)
mostFreq = freqC.most_common(1) # get the most frequent value of the attribute(grouped by class)
# newGroup = group.replace(np.nan , mostFreq)
innerDict[name.strip()] = mostFreq[0][0].strip()
replaceValues[eachlabel] = innerDict
print replaceValues
# replace the missing values =======================
newfile = open('missingReplaced.csv', 'w')
newdf = df
mask=False
for col in df.columns: mask = mask | df[col].isnull()
# get the dataframe of tuples that contains nulls
dfnulls = df[mask]
dfnotNulls = df[~mask]
for _, row in dfnulls.iterrows():
for colname in dataLabels:
if pd.isnull(row[colname]):
if row['class'].strip() == '>50K':
row[colname] = replaceValues[colname]['>50K']
else:
row[colname] = replaceValues[colname]['<=50K']
newfile.write(str(row[colname]) + ",")
newdf.append(row)
newfile.write("\n")
# here add newdf to dfnotNulls to get finaldf
return finaldf
Upvotes: 3
Views: 3471
Reputation: 30444
If I understand correctly, this is mostly in the documentation, but probably not where you'd be looking if you're asking the question. See note regarding mode
at the bottom as it is slightly trickier than mean
and median
.
df = pd.DataFrame({ 'v':[1,2,2,np.nan,3,4,4,np.nan] }, index=[1,1,1,1,2,2,2,2],)
df['v_mean'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mean()))
df['v_med' ] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.median()))
df['v_mode'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mode()[0]))
df
v v_mean v_med v_mode
1 1 1.000000 1 1
1 2 2.000000 2 2
1 2 2.000000 2 2
1 NaN 1.666667 2 2
2 3 3.000000 3 3
2 4 4.000000 4 4
2 4 4.000000 4 4
2 NaN 3.666667 4 4
Note that mode()
may not be unique, unlike mean
and median
and pandas returns it as a Series
for that reason. To deal with that, I just took the simplest route and added [0]
in order to extract the first member of the series.
Upvotes: 3