Reputation: 383
I have a Pandas DataFrame with program
, dataset
, algorithm
, and result
fields, where result
indicates the runtime of a program running on a particular algorithm and dataset. Some of the results are missing. I'd like to fill in those missing results with a result from the same dataset
and algorithm
for a reference program, Program-A
.
I would be happy to take any suggestions for how I can improve my code. But my specific question is why I can't pass a DataFrame into the value argument of fillna
but instead have to turn it into a dict. (The docs say value : scalar, dict, Series, or DataFrame
.)
col = ['program', 'dataset', 'algorithm', 'result']
df = pandas.DataFrame(
[['program-A', 'dataset-X', 'algorithm-i', 1],
['program-A', 'dataset-X', 'algorithm-j', 2],
['program-A', 'dataset-Y', 'algorithm-i', 3],
['program-A', 'dataset-Y', 'algorithm-j', 4],
['program-B', 'dataset-X', 'algorithm-j', numpy.NaN]
], columns=col)
df['algorithm_dataset'] = df['algorithm'] + "_" + df['dataset']
# build a dict from {algorithm+dataset} to result
dfg = df.loc[df['program'] == 'program-A'][['algorithm_dataset',
'result']]
dfg = dfg.set_index('algorithm_dataset')
dfg_dict = dfg.to_dict()['result']
df = df.set_index('algorithm_dataset')
# df['result'] = df['result'].fillna(value=dfg)
# what's above doesn't work:
# ValueError: invalid fill value with a <class 'pandas.core.frame.DataFrame'>
# so instead:
df['result'] = df['result'].fillna(value=dfg_dict)
df = df.reset_index()
print df
Versions:
$ port installed | grep pandas
py27-pandas @0.19.1_0 (active)
$ python --version
Python 2.7.12
Upvotes: 1
Views: 1053
Reputation: 862791
You can use Series
instead dict
to fillna
if need works with column
(Series
):
ser = dfg.set_index('algorithm_dataset')['result']
print (ser)
algorithm_dataset
algorithm-i_dataset-X 1.0
algorithm-j_dataset-X 2.0
algorithm-i_dataset-Y 3.0
algorithm-j_dataset-Y 4.0
Name: result, dtype: float64
df = df.set_index('algorithm_dataset')
df['result1'] = df['result'].fillna(value=ser)
print (df)
program dataset algorithm result result1
algorithm_dataset
algorithm-i_dataset-X program-A dataset-X algorithm-i 1.0 1.0
algorithm-j_dataset-X program-A dataset-X algorithm-j 2.0 2.0
algorithm-i_dataset-Y program-A dataset-Y algorithm-i 3.0 3.0
algorithm-j_dataset-Y program-A dataset-Y algorithm-j 4.0 4.0
algorithm-j_dataset-X program-B dataset-X algorithm-j NaN 2.0
df['result'] = df['result'].fillna(value=ser)
print (df)
program dataset algorithm result
algorithm_dataset
algorithm-i_dataset-X program-A dataset-X algorithm-i 1.0
algorithm-j_dataset-X program-A dataset-X algorithm-j 2.0
algorithm-i_dataset-Y program-A dataset-Y algorithm-i 3.0
algorithm-j_dataset-Y program-A dataset-Y algorithm-j 4.0
algorithm-j_dataset-X program-B dataset-X algorithm-j 2.0
If need fillna
by DataFrame
, you have to create first another DataFrame
with same index
and with same columns and then it works:
dfg = df.loc[df['program'] == 'program-A'][['algorithm_dataset',
'result']]
dfg = dfg.set_index('algorithm_dataset')['result'].to_frame()
print (dfg)
result
algorithm_dataset
algorithm-i_dataset-X 1.0
algorithm-j_dataset-X 2.0
algorithm-i_dataset-Y 3.0
algorithm-j_dataset-Y 4.0
df = df.set_index('algorithm_dataset')
df = df.drop(['program','dataset','algorithm'], axis=1)
print (df)
result
algorithm_dataset
algorithm-i_dataset-X 1.0
algorithm-j_dataset-X 2.0
algorithm-i_dataset-Y 3.0
algorithm-j_dataset-Y 4.0
algorithm-j_dataset-X NaN
dfg = dfg.reindex(df.index)
print (dfg)
result
algorithm_dataset
algorithm-i_dataset-X 1.0
algorithm-j_dataset-X 2.0
algorithm-i_dataset-Y 3.0
algorithm-j_dataset-Y 4.0
algorithm-j_dataset-X 2.0
df = df.fillna(dfg)
print (df)
lgorithm_dataset
algorithm-i_dataset-X 1.0
algorithm-j_dataset-X 2.0
algorithm-i_dataset-Y 3.0
algorithm-j_dataset-Y 4.0
algorithm-j_dataset-X 2.0
Upvotes: 1