S. Mandal
S. Mandal

Reputation: 71

How to create a mask for a specific column in a dataframe in python?

I created a mask to replace detected outliers with NaN values in a specific column in a dataframe, and the code I wrote worked perfectly for the random dataframe I created, but the the same code did not work for the actual dataframe I am working on.

Here is the code using random dataframe:

import numpy as np

import pandas as pd
df =pd.DataFrame (np.random.randint(0,1000,size=(4,10)), columns=('A','B','C','D','E','F','G','H','I','J'))

df

lower= np.percentile(df['B'],25)
upper= np.percentile(df['B'],75)
outliers= [x for x in df['B'] if x < lower or x > upper]
print('Identified Outliers %d'% len(outliers))

mask= ((df['B']<lower)| (df['B']>upper))
df['B'][mask]=np.nan

Upvotes: 1

Views: 2867

Answers (2)

Trenton McKinney
Trenton McKinney

Reputation: 62543

  • I recommend using pandas.DataFrame.quantile.
    • With the default of axis=0, the specified quantile for each column is calculated.
    • By default, numeric_only=True, so only numeric values are considered, but if False is specified, this will work for datetime and timedelta data as well.
    • Columns that are not of numeric / dateime / timedelta type, will be ignored.
  • Use Pandas: Boolean Indexing to filter the dataframe along all the columns, at once.
  • To get the number of remaining numeric values, use filtered.count()
    • To find the number of NaN values, use df.count() - filtereed.count().
    • See pandas.DataFrame.count for parameter specifics.
  • In regards to the "real" dataframe, it's not possible to determine the issue, as it is not available.
    • Use df.info() to verify the Dtype of columns are a numeric type.
import pandas as pd
import numpy as np

# test data and dataframe
np.random.seed(50)
df = pd.DataFrame(np.random.randint(0, 10000, size=(4,10)) / 10, columns=('A','B','C','D','E','F','G','H','I','J'))
df['k'] = ['a', 'b', 'c', 'd']

# display(df)

       A      B      C      D      E      F      G      H      I      J  k
0  560.0  625.3  832.4  621.4  826.2  791.7  730.1  623.9  741.8  211.9  a
1  855.9  147.6  302.2   60.3  220.2  431.4  730.2  347.6  388.3  648.5  b
2  511.8   50.7  461.4  371.4  451.0  727.3  963.5  561.9   37.1  800.2  c
3   99.2  493.1  180.2  612.8  574.2  572.6  102.4  195.0  988.2  824.3  d


# calculate upper and lower quantiles
quantiles = df.quantile([.25, .75])

# display(quantiles)
            A        B       C        D      E      F        G       H      I        J
0.25  408.650  123.375  271.70  293.625  393.3  537.3  573.175  309.45  300.5  539.350
0.75  633.975  526.150  554.15  614.950  637.2  743.4  788.525  577.40  803.4  806.225

# filter the dataframe
filtered = df[(df < quantiles.loc[0.75]) & (df > quantiles.loc[0.25])]

# display(filtered)
       A      B      C      D      E      F      G      H      I      J    k
0  560.0    NaN    NaN    NaN    NaN    NaN  730.1    NaN  741.8    NaN  NaN
1    NaN  147.6  302.2    NaN    NaN    NaN  730.2  347.6  388.3  648.5  NaN
2  511.8    NaN  461.4  371.4  451.0  727.3    NaN  561.9    NaN  800.2  NaN
3    NaN  493.1    NaN  612.8  574.2  572.6    NaN    NaN    NaN    NaN  NaN

print(filtered.count())
[out]:
A    2
B    2
C    2
D    2
E    2
F    2
G    2
H    2
I    2
J    2
k    0
dtype: int64

Upvotes: 1

Gustav Rasmussen
Gustav Rasmussen

Reputation: 3971

Could be due to incompatible dtypes. Define a function to encapsulate the functionality, then run it for different dtype-columned dataframes, see below example:

import numpy as np
import pandas as pd


def mask_column(df):
    print(df)
    col_to_mask = df.columns.values[1]
    lower = np.percentile(df[col_to_mask], 25)
    upper = np.percentile(df[col_to_mask], 75)
    outliers = [x for x in df[col_to_mask] if x < lower or x > upper]
    print('Identified Outliers %d' % len(outliers))
    mask = ((df[col_to_mask] < lower) | (df[col_to_mask] > upper))
    df[col_to_mask][mask] = np.nan
    print(df)


df_1 = pd.DataFrame(np.random.randint(0, 1000, size=(4, 10)),
                  columns=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J')
                  )

df_2 = pd.DataFrame(np.random.randint(0, 1000, size=(4, 10)),
                  columns=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J')
                  )
df_2['B'] = df_1['B'].astype(float)

df_3 = pd.DataFrame(np.random.randint(0, 1000, size=(4, 10)),
                  columns=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J')
                  )
df_3['B'] = df_1['B'].astype(str)

# mask_column(df_1)
# mask_column(df_2)
mask_column(df_3)

The first two function calls will succeed in applying the boolean mask, but not the third function call:

Traceback (most recent call last):
  File "C:/Users/gtrm/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/scratch_56.py", line 34, in <module>
    mask_column(df_3)
  File "C:/Users/gtrm/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/scratch_56.py", line 9, in mask_column
    lower = np.percentile(df[col_to_mask], 25)
  File "<__array_function__ internals>", line 5, in percentile
  File "C:\Users\gtrm\AppData\Local\Continuum\anaconda3\envs\py38\lib\site-packages\numpy\lib\function_base.py", line 3705, in percentile
    return _quantile_unchecked(
  File "C:\Users\gtrm\AppData\Local\Continuum\anaconda3\envs\py38\lib\site-packages\numpy\lib\function_base.py", line 3824, in _quantile_unchecked
    r, k = _ureduce(a, func=_quantile_ureduce_func, q=q, axis=axis, out=out,
  File "C:\Users\gtrm\AppData\Local\Continuum\anaconda3\envs\py38\lib\site-packages\numpy\lib\function_base.py", line 3403, in _ureduce
    r = func(a, **kwargs)
  File "C:\Users\gtrm\AppData\Local\Continuum\anaconda3\envs\py38\lib\site-packages\numpy\lib\function_base.py", line 3941, in _quantile_ureduce_func
    x1 = take(ap, indices_below, axis=axis) * weights_below
TypeError: can't multiply sequence by non-int of type 'float'
     A    B    C    D    E    F    G    H    I    J
0  450  524  545  697   94  703   97  894  710  974
1  238  367   48  224  698  116  974  943  235  244
2  503  107  937  700  506  411  818  511  932  641
3  993  148  284  580  218  957  917   73   96  853

Upvotes: 1

Related Questions