Replacing outliers with NaN in pandas dataframe after applying a .groupby() arguement

Question

I would like to remove outliers from a pandas dataframe using the standard deviation for a column variable after applying a groupby function.

Here is my data frame:

            ARI      Flesch    Kincaid             Speaker     Score
0     -2.090000  121.220000  -3.400000                 NaN       NaN   
1      8.276460   64.478573   9.034156      William Dudley  1.670275   
2     19.570911   27.362067  17.253580        Janet Yellen -0.604757   
3     -2.090000  121.220000  -3.400000                 NaN       NaN   
4     -2.090000  121.220000  -3.400000                 NaN       NaN   
5     20.643483   17.069411  18.394178       Lael Brainard  0.215396   
6     -2.090000  121.220000  -3.400000                 NaN       NaN   
7     -2.090000  121.220000  -3.400000                 NaN       NaN   
8     12.624198   52.220468  11.403157    Jerome H. Powell -1.350798   
9     18.466305   35.186261  16.205693     Stanley Fischer  0.522121   
10    -2.090000  121.220000  -3.400000                 NaN       NaN   
11    16.953460   36.246573  15.323457       Lael Brainard -0.217779   
12    -2.090000  121.220000  -3.400000                 NaN       NaN   
13    -2.090000  121.220000  -3.400000                 NaN       NaN   
14    17.066088   32.592551  16.108486     Stanley Fischer  0.642245   
15    -2.090000  121.220000  -3.400000                 NaN       NaN

I would like to first group the dataframe by 'Speaker' and then remove 'ARI', 'Flesch', and 'Kincaid' values that outliers as defined by being more than 3 standard deviations from the mean of the scores for the specific feature.

Please let me know if this is possible. Thanks!

Alber8295 · Accepted Answer

The only required dependency for this approach is Pandas

Suppose we have replaced the 'Speaker' columns values 'NaN' with something representative like 'CommitteOrganization'

speaker = dataset['Speaker'].fillna(value='CommitteeOrganization') dataset['Speaker'] = speaker

So we have our data like:

Index ARI   Flesch  Kincaid Speaker Score
0   -2.090000   121.220000  -3.400000   CommitteeOrganization   NaN
1   8.276460    64.478573   9.034156    WilliamDudley   1.670275
2   19.570911   27.362067   17.253580   JanetYellen -0.604757
3   -2.090000   121.220000  -3.400000   CommitteeOrganization   NaN
4   -2.090000   121.220000  -3.400000   CommitteeOrganization   NaN

Group by with the Pandas function:

datasetGrouped = dataset.groupby(by='Speaker').mean()

So we have our data like:

Speaker             ARI Flesch  Kincaid Score
CommitteeOrganization   -2.090000   121.220000  -3.400000   NaN
JanetYellen 19.570911   27.362067   17.253580   -0.604757
JeromeH.Powell  12.624198   52.220468   11.403157   -1.350798
LaelBrainard    18.798471   26.657992   16.858818   -0.001191
StanleyFischer  17.766196   33.889406   16.157089   0.582183
WilliamDudley   8.276460    64.478573   9.034156    1.670275

Compute the Standard Deviations for each columns:

aristd = datasetGrouped['ARI'].std()
fleschstd = datasetGrouped['Flesch'].std()
kincaidstd = datasetGrouped['Kincaid'].std()

Replace the values with 'NaN' on the rows that meets the condition:

datasetGrouped.loc[abs(datasetGrouped.ARI) > aristd*3,'ARI'] = 'NaN'
datasetGrouped.loc[abs(datasetGrouped.Flesch) > fleschstd*3,'Flesch'] = 'NaN'
datasetGrouped.loc[abs(datasetGrouped.Kincaid) > kincaidstd*3,'Kincaid'] = 'NaN'

The final dataset is:

Speaker             ARI Flesch  Kincaid Score
CommitteeOrganization   -2.090000   NaN -3.400000   NaN
JanetYellen 19.570911   27.3621 17.253580   -0.604757
JeromeH.Powell  12.624198   52.2205 11.403157   -1.350798
LaelBrainard    18.798471   26.658  16.858818   -0.001191
StanleyFischer  17.766196   33.8894 16.157089   0.582183
WilliamDudley   8.276460    64.4786 9.034156    1.670275

Full code available on: Github

Note: This could be done in less code than presented, but the answer it's done "step by step" for easy understanding.

Note2: Because the question was a little ambiguous, if I didn't understand well something and don't provide the right answer, don't hesitate to tell me and I'll update the answer if possible

Replacing outliers with NaN in pandas dataframe after applying a .groupby() arguement

Answers (1)

Related Questions