jowens
jowens

Reputation: 383

Pandas: use fillna with a dataframe as value argument

I have a Pandas DataFrame with program, dataset, algorithm, and result fields, where result indicates the runtime of a program running on a particular algorithm and dataset. Some of the results are missing. I'd like to fill in those missing results with a result from the same dataset and algorithm for a reference program, Program-A.

I would be happy to take any suggestions for how I can improve my code. But my specific question is why I can't pass a DataFrame into the value argument of fillna but instead have to turn it into a dict. (The docs say value : scalar, dict, Series, or DataFrame.)

col = ['program', 'dataset', 'algorithm', 'result']
df = pandas.DataFrame(
    [['program-A', 'dataset-X', 'algorithm-i', 1],
     ['program-A', 'dataset-X', 'algorithm-j', 2],
     ['program-A', 'dataset-Y', 'algorithm-i', 3],
     ['program-A', 'dataset-Y', 'algorithm-j', 4],
     ['program-B', 'dataset-X', 'algorithm-j', numpy.NaN]
     ], columns=col)

df['algorithm_dataset'] = df['algorithm'] + "_" + df['dataset']

# build a dict from {algorithm+dataset} to result
dfg = df.loc[df['program'] == 'program-A'][['algorithm_dataset',
                                            'result']]
dfg = dfg.set_index('algorithm_dataset')
dfg_dict = dfg.to_dict()['result']

df = df.set_index('algorithm_dataset')
# df['result'] = df['result'].fillna(value=dfg)
# what's above doesn't work:
# ValueError: invalid fill value with a <class 'pandas.core.frame.DataFrame'>
# so instead:
df['result'] = df['result'].fillna(value=dfg_dict)
df = df.reset_index()

print df

Versions:

$ port installed | grep pandas
  py27-pandas @0.19.1_0 (active)
$ python --version
Python 2.7.12

Upvotes: 1

Views: 1053

Answers (1)

jezrael
jezrael

Reputation: 862791

You can use Series instead dict to fillna if need works with column (Series):

ser = dfg.set_index('algorithm_dataset')['result']
print (ser)
algorithm_dataset
algorithm-i_dataset-X    1.0
algorithm-j_dataset-X    2.0
algorithm-i_dataset-Y    3.0
algorithm-j_dataset-Y    4.0
Name: result, dtype: float64

df = df.set_index('algorithm_dataset')
df['result1'] = df['result'].fillna(value=ser)
print (df)
                         program    dataset    algorithm  result  result1
algorithm_dataset                                                        
algorithm-i_dataset-X  program-A  dataset-X  algorithm-i     1.0      1.0
algorithm-j_dataset-X  program-A  dataset-X  algorithm-j     2.0      2.0
algorithm-i_dataset-Y  program-A  dataset-Y  algorithm-i     3.0      3.0
algorithm-j_dataset-Y  program-A  dataset-Y  algorithm-j     4.0      4.0
algorithm-j_dataset-X  program-B  dataset-X  algorithm-j     NaN      2.0

df['result'] = df['result'].fillna(value=ser)
print (df)
                         program    dataset    algorithm  result
algorithm_dataset                                               
algorithm-i_dataset-X  program-A  dataset-X  algorithm-i     1.0
algorithm-j_dataset-X  program-A  dataset-X  algorithm-j     2.0
algorithm-i_dataset-Y  program-A  dataset-Y  algorithm-i     3.0
algorithm-j_dataset-Y  program-A  dataset-Y  algorithm-j     4.0
algorithm-j_dataset-X  program-B  dataset-X  algorithm-j     2.0

If need fillna by DataFrame, you have to create first another DataFrame with same index and with same columns and then it works:

dfg = df.loc[df['program'] == 'program-A'][['algorithm_dataset',
                                            'result']]

dfg = dfg.set_index('algorithm_dataset')['result'].to_frame()
print (dfg)
                       result
algorithm_dataset            
algorithm-i_dataset-X     1.0
algorithm-j_dataset-X     2.0
algorithm-i_dataset-Y     3.0
algorithm-j_dataset-Y     4.0

df = df.set_index('algorithm_dataset')
df = df.drop(['program','dataset','algorithm'], axis=1)
print (df)
                       result
algorithm_dataset            
algorithm-i_dataset-X     1.0
algorithm-j_dataset-X     2.0
algorithm-i_dataset-Y     3.0
algorithm-j_dataset-Y     4.0
algorithm-j_dataset-X     NaN

dfg = dfg.reindex(df.index)
print (dfg)
                       result
algorithm_dataset            
algorithm-i_dataset-X     1.0
algorithm-j_dataset-X     2.0
algorithm-i_dataset-Y     3.0
algorithm-j_dataset-Y     4.0
algorithm-j_dataset-X     2.0
df = df.fillna(dfg)
print (df)
lgorithm_dataset            
algorithm-i_dataset-X     1.0
algorithm-j_dataset-X     2.0
algorithm-i_dataset-Y     3.0
algorithm-j_dataset-Y     4.0
algorithm-j_dataset-X     2.0

Upvotes: 1

Related Questions