Reputation: 3817
Consider the following dataframe:
b c d e f g h
0 6.25 2018-04-01 True NaN 7 54.0 64.0
1 32.50 2018-04-01 True NaN 7 54.0 64.0
2 16.75 2018-04-01 True NaN 7 54.0 64.0
3 29.25 2018-04-01 True NaN 7 54.0 64.0
4 21.75 2018-04-01 True NaN 7 54.0 64.0
5 21.75 2018-04-01 True True 7 54.0 64.0
6 7.75 2018-04-01 True True 7 54.0 64.0
7 23.25 2018-04-01 True True 7 54.0 64.0
8 12.25 2018-04-01 True True 7 54.0 64.0
9 30.50 2018-04-01 True NaN 7 54.0 64.0
(copy and paste and use df = pd.read_clipboard()
to create the dataframe)
Finding the medians initially works with no problem:
df.median()
b 21.75
d 1.00
e 1.00
f 7.00
g 54.00
h 64.00
dtype: float64
However, if a column is dropped and then the median
is found, the median for column e
disappears:
new_df = df.drop(columns=['b'])
new_df.median()
d 1.0
f 7.0
g 54.0
h 64.0
dtype: float64
This behavior is a little unexpected and finding the median for column e by itself still works:
new_df['e'].median()
1.0
Using skipna=False
does not make a difference:
new_df.median(skipna=False)
d 1.0
f 7.0
g 54.0
h 64.0
dtype: float64
(it does for the original dataframe):
df.median(skipna=False)
b 21.75
d 1.00
e NaN
f 7.00
g 54.00
h 64.00
dtype: float64
The datatype of column e
is object
in both df
and new_df
and the only difference between the two dataframes is new_df
does not have column b
. Adding the column back into new_df
does not resolve the issue. This only occurs when the first column b
is dropped. It does not occur if column e
is a float or integer datatype.
This behavior is present in both pandas==0.22.0
and pandas==0.24.1
There is now an open GitHub issue for anyone to try and solve this!
Upvotes: 17
Views: 1890
Reputation: 529
This appears to be a bug. When we dispatch any df to median
, this maps to the internal _reduce
function. With numeric_only
set to None
, this computes the median by series, and ignore failures (for the c
columns, for e.g. median computation will fail.) and accumulate results (see _reduce
in pandas source core/frame.py). So far it is fine. But while stiching the results together through it does a check to infer if the results are scalar or series (for median
it will be scalar of course). To do this check, it always use the first column (see wrap_results
in pandas source core/apply.py). So if the first column calc failed and it was skipped, this check fails, raising an exception. This triggers the fallback method within _reduce
of forcing the dataframe to numeric only (dropping any columns with NaN
) and re-compute the medians.
So in your case, if the column c (or any other dtype where median computation will fail, like text) is in the first column, then all columns with NaN
will also be dropped for the median results. Setting skipna
does not change as the bug is with how non-numeric column in first position triggers a forced numeric only computation. I do not see there is any fix possible without fixing it in the pandas codebase. Or ensuring first column will always succeed for median computation.
Upvotes: 3