Reputation: 71
I created a mask to replace detected outliers with NaN values in a specific column in a dataframe, and the code I wrote worked perfectly for the random dataframe I created, but the the same code did not work for the actual dataframe I am working on.
Here is the code using random dataframe:
import numpy as np
import pandas as pd
df =pd.DataFrame (np.random.randint(0,1000,size=(4,10)), columns=('A','B','C','D','E','F','G','H','I','J'))
df
lower= np.percentile(df['B'],25)
upper= np.percentile(df['B'],75)
outliers= [x for x in df['B'] if x < lower or x > upper]
print('Identified Outliers %d'% len(outliers))
mask= ((df['B']<lower)| (df['B']>upper))
df['B'][mask]=np.nan
Upvotes: 1
Views: 2867
Reputation: 62543
pandas.DataFrame.quantile
.
axis=0
, the specified quantile for each column is calculated.numeric_only=True
, so only numeric values are considered, but if False
is specified, this will work for datetime
and timedelta
data as well.numeric
/ dateime
/ timedelta
type, will be ignored.filtered.count()
NaN
values, use df.count() - filtereed.count()
.pandas.DataFrame.count
for parameter specifics.df.info()
to verify the Dtype
of columns are a numeric type.import pandas as pd
import numpy as np
# test data and dataframe
np.random.seed(50)
df = pd.DataFrame(np.random.randint(0, 10000, size=(4,10)) / 10, columns=('A','B','C','D','E','F','G','H','I','J'))
df['k'] = ['a', 'b', 'c', 'd']
# display(df)
A B C D E F G H I J k
0 560.0 625.3 832.4 621.4 826.2 791.7 730.1 623.9 741.8 211.9 a
1 855.9 147.6 302.2 60.3 220.2 431.4 730.2 347.6 388.3 648.5 b
2 511.8 50.7 461.4 371.4 451.0 727.3 963.5 561.9 37.1 800.2 c
3 99.2 493.1 180.2 612.8 574.2 572.6 102.4 195.0 988.2 824.3 d
# calculate upper and lower quantiles
quantiles = df.quantile([.25, .75])
# display(quantiles)
A B C D E F G H I J
0.25 408.650 123.375 271.70 293.625 393.3 537.3 573.175 309.45 300.5 539.350
0.75 633.975 526.150 554.15 614.950 637.2 743.4 788.525 577.40 803.4 806.225
# filter the dataframe
filtered = df[(df < quantiles.loc[0.75]) & (df > quantiles.loc[0.25])]
# display(filtered)
A B C D E F G H I J k
0 560.0 NaN NaN NaN NaN NaN 730.1 NaN 741.8 NaN NaN
1 NaN 147.6 302.2 NaN NaN NaN 730.2 347.6 388.3 648.5 NaN
2 511.8 NaN 461.4 371.4 451.0 727.3 NaN 561.9 NaN 800.2 NaN
3 NaN 493.1 NaN 612.8 574.2 572.6 NaN NaN NaN NaN NaN
print(filtered.count())
[out]:
A 2
B 2
C 2
D 2
E 2
F 2
G 2
H 2
I 2
J 2
k 0
dtype: int64
Upvotes: 1
Reputation: 3971
Could be due to incompatible dtypes. Define a function to encapsulate the functionality, then run it for different dtype-columned dataframes, see below example:
import numpy as np
import pandas as pd
def mask_column(df):
print(df)
col_to_mask = df.columns.values[1]
lower = np.percentile(df[col_to_mask], 25)
upper = np.percentile(df[col_to_mask], 75)
outliers = [x for x in df[col_to_mask] if x < lower or x > upper]
print('Identified Outliers %d' % len(outliers))
mask = ((df[col_to_mask] < lower) | (df[col_to_mask] > upper))
df[col_to_mask][mask] = np.nan
print(df)
df_1 = pd.DataFrame(np.random.randint(0, 1000, size=(4, 10)),
columns=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J')
)
df_2 = pd.DataFrame(np.random.randint(0, 1000, size=(4, 10)),
columns=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J')
)
df_2['B'] = df_1['B'].astype(float)
df_3 = pd.DataFrame(np.random.randint(0, 1000, size=(4, 10)),
columns=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J')
)
df_3['B'] = df_1['B'].astype(str)
# mask_column(df_1)
# mask_column(df_2)
mask_column(df_3)
The first two function calls will succeed in applying the boolean mask, but not the third function call:
Traceback (most recent call last):
File "C:/Users/gtrm/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/scratch_56.py", line 34, in <module>
mask_column(df_3)
File "C:/Users/gtrm/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/scratch_56.py", line 9, in mask_column
lower = np.percentile(df[col_to_mask], 25)
File "<__array_function__ internals>", line 5, in percentile
File "C:\Users\gtrm\AppData\Local\Continuum\anaconda3\envs\py38\lib\site-packages\numpy\lib\function_base.py", line 3705, in percentile
return _quantile_unchecked(
File "C:\Users\gtrm\AppData\Local\Continuum\anaconda3\envs\py38\lib\site-packages\numpy\lib\function_base.py", line 3824, in _quantile_unchecked
r, k = _ureduce(a, func=_quantile_ureduce_func, q=q, axis=axis, out=out,
File "C:\Users\gtrm\AppData\Local\Continuum\anaconda3\envs\py38\lib\site-packages\numpy\lib\function_base.py", line 3403, in _ureduce
r = func(a, **kwargs)
File "C:\Users\gtrm\AppData\Local\Continuum\anaconda3\envs\py38\lib\site-packages\numpy\lib\function_base.py", line 3941, in _quantile_ureduce_func
x1 = take(ap, indices_below, axis=axis) * weights_below
TypeError: can't multiply sequence by non-int of type 'float'
A B C D E F G H I J
0 450 524 545 697 94 703 97 894 710 974
1 238 367 48 224 698 116 974 943 235 244
2 503 107 937 700 506 411 818 511 932 641
3 993 148 284 580 218 957 917 73 96 853
Upvotes: 1