Reputation: 71

How to create a mask for a specific column in a dataframe in python?

I created a mask to replace detected outliers with NaN values in a specific column in a dataframe, and the code I wrote worked perfectly for the random dataframe I created, but the the same code did not work for the actual dataframe I am working on.

Here is the code using random dataframe:

import numpy as np

import pandas as pd
df =pd.DataFrame (np.random.randint(0,1000,size=(4,10)), columns=('A','B','C','D','E','F','G','H','I','J'))

df

lower= np.percentile(df['B'],25)
upper= np.percentile(df['B'],75)
outliers= [x for x in df['B'] if x < lower or x > upper]
print('Identified Outliers %d'% len(outliers))

mask= ((df['B']<lower)| (df['B']>upper))
df['B'][mask]=np.nan

The code above worked perfectly for this dataframe, Number of identified outliers and number of replaced values to NaN are equal.
Surprisingly, same code did not work for the actual dataframe, though it identified the numbers of outliers, but did not replace the outliers to NaN values.
Is there any particular reason for it? is there anything needs to be done with datatype of that column of the the actual dataframe?

Upvotes: 1

Answers (2)

Trenton McKinney

Reputation: 62543

I recommend using pandas.DataFrame.quantile.
- With the default of axis=0, the specified quantile for each column is calculated.
- By default, numeric_only=True, so only numeric values are considered, but if False is specified, this will work for datetime and timedelta data as well.
- Columns that are not of numeric / dateime / timedelta type, will be ignored.
Use Pandas: Boolean Indexing to filter the dataframe along all the columns, at once.
To get the number of remaining numeric values, use filtered.count()
- To find the number of NaN values, use df.count() - filtereed.count().
- See pandas.DataFrame.count for parameter specifics.
In regards to the "real" dataframe, it's not possible to determine the issue, as it is not available.
- Use df.info() to verify the Dtype of columns are a numeric type.

import pandas as pd
import numpy as np

# test data and dataframe
np.random.seed(50)
df = pd.DataFrame(np.random.randint(0, 10000, size=(4,10)) / 10, columns=('A','B','C','D','E','F','G','H','I','J'))
df['k'] = ['a', 'b', 'c', 'd']

# display(df)

       A      B      C      D      E      F      G      H      I      J  k
0  560.0  625.3  832.4  621.4  826.2  791.7  730.1  623.9  741.8  211.9  a
1  855.9  147.6  302.2   60.3  220.2  431.4  730.2  347.6  388.3  648.5  b
2  511.8   50.7  461.4  371.4  451.0  727.3  963.5  561.9   37.1  800.2  c
3   99.2  493.1  180.2  612.8  574.2  572.6  102.4  195.0  988.2  824.3  d


# calculate upper and lower quantiles
quantiles = df.quantile([.25, .75])

# display(quantiles)
            A        B       C        D      E      F        G       H      I        J
0.25  408.650  123.375  271.70  293.625  393.3  537.3  573.175  309.45  300.5  539.350
0.75  633.975  526.150  554.15  614.950  637.2  743.4  788.525  577.40  803.4  806.225

# filter the dataframe
filtered = df[(df < quantiles.loc[0.75]) & (df > quantiles.loc[0.25])]

# display(filtered)
       A      B      C      D      E      F      G      H      I      J    k
0  560.0    NaN    NaN    NaN    NaN    NaN  730.1    NaN  741.8    NaN  NaN
1    NaN  147.6  302.2    NaN    NaN    NaN  730.2  347.6  388.3  648.5  NaN
2  511.8    NaN  461.4  371.4  451.0  727.3    NaN  561.9    NaN  800.2  NaN
3    NaN  493.1    NaN  612.8  574.2  572.6    NaN    NaN    NaN    NaN  NaN

print(filtered.count())
[out]:
A    2
B    2
C    2
D    2
E    2
F    2
G    2
H    2
I    2
J    2
k    0
dtype: int64

Upvotes: 1

Gustav Rasmussen

Reputation: 3971

Could be due to incompatible dtypes. Define a function to encapsulate the functionality, then run it for different dtype-columned dataframes, see below example:

import numpy as np
import pandas as pd


def mask_column(df):
    print(df)
    col_to_mask = df.columns.values[1]
    lower = np.percentile(df[col_to_mask], 25)
    upper = np.percentile(df[col_to_mask], 75)
    outliers = [x for x in df[col_to_mask] if x < lower or x > upper]
    print('Identified Outliers %d' % len(outliers))
    mask = ((df[col_to_mask] < lower) | (df[col_to_mask] > upper))
    df[col_to_mask][mask] = np.nan
    print(df)


df_1 = pd.DataFrame(np.random.randint(0, 1000, size=(4, 10)),
                  columns=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J')
                  )

df_2 = pd.DataFrame(np.random.randint(0, 1000, size=(4, 10)),
                  columns=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J')
                  )
df_2['B'] = df_1['B'].astype(float)

df_3 = pd.DataFrame(np.random.randint(0, 1000, size=(4, 10)),
                  columns=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J')
                  )
df_3['B'] = df_1['B'].astype(str)

# mask_column(df_1)
# mask_column(df_2)
mask_column(df_3)

The first two function calls will succeed in applying the boolean mask, but not the third function call:

Traceback (most recent call last):
  File "C:/Users/gtrm/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/scratch_56.py", line 34, in <module>
    mask_column(df_3)
  File "C:/Users/gtrm/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/scratch_56.py", line 9, in mask_column
    lower = np.percentile(df[col_to_mask], 25)
  File "<__array_function__ internals>", line 5, in percentile
  File "C:\Users\gtrm\AppData\Local\Continuum\anaconda3\envs\py38\lib\site-packages\numpy\lib\function_base.py", line 3705, in percentile
    return _quantile_unchecked(
  File "C:\Users\gtrm\AppData\Local\Continuum\anaconda3\envs\py38\lib\site-packages\numpy\lib\function_base.py", line 3824, in _quantile_unchecked
    r, k = _ureduce(a, func=_quantile_ureduce_func, q=q, axis=axis, out=out,
  File "C:\Users\gtrm\AppData\Local\Continuum\anaconda3\envs\py38\lib\site-packages\numpy\lib\function_base.py", line 3403, in _ureduce
    r = func(a, **kwargs)
  File "C:\Users\gtrm\AppData\Local\Continuum\anaconda3\envs\py38\lib\site-packages\numpy\lib\function_base.py", line 3941, in _quantile_ureduce_func
    x1 = take(ap, indices_below, axis=axis) * weights_below
TypeError: can't multiply sequence by non-int of type 'float'
     A    B    C    D    E    F    G    H    I    J
0  450  524  545  697   94  703   97  894  710  974
1  238  367   48  224  698  116  974  943  235  244
2  503  107  937  700  506  411  818  511  932  641
3  993  148  284  580  218  957  917   73   96  853

Upvotes: 1

How to create a mask for a specific column in a dataframe in python?

Answers (2)

Related Questions