Reputation: 924
I would like to remove outliers from a pandas dataframe using the standard deviation for a column variable after applying a groupby function.
Here is my data frame:
ARI Flesch Kincaid Speaker Score
0 -2.090000 121.220000 -3.400000 NaN NaN
1 8.276460 64.478573 9.034156 William Dudley 1.670275
2 19.570911 27.362067 17.253580 Janet Yellen -0.604757
3 -2.090000 121.220000 -3.400000 NaN NaN
4 -2.090000 121.220000 -3.400000 NaN NaN
5 20.643483 17.069411 18.394178 Lael Brainard 0.215396
6 -2.090000 121.220000 -3.400000 NaN NaN
7 -2.090000 121.220000 -3.400000 NaN NaN
8 12.624198 52.220468 11.403157 Jerome H. Powell -1.350798
9 18.466305 35.186261 16.205693 Stanley Fischer 0.522121
10 -2.090000 121.220000 -3.400000 NaN NaN
11 16.953460 36.246573 15.323457 Lael Brainard -0.217779
12 -2.090000 121.220000 -3.400000 NaN NaN
13 -2.090000 121.220000 -3.400000 NaN NaN
14 17.066088 32.592551 16.108486 Stanley Fischer 0.642245
15 -2.090000 121.220000 -3.400000 NaN NaN
I would like to first group the dataframe by 'Speaker' and then remove 'ARI', 'Flesch', and 'Kincaid' values that outliers as defined by being more than 3 standard deviations from the mean of the scores for the specific feature.
Please let me know if this is possible. Thanks!
Upvotes: 0
Views: 2779
Reputation: 763
The only required dependency for this approach is Pandas
Suppose we have replaced the 'Speaker' columns values 'NaN' with something representative like 'CommitteOrganization'
speaker = dataset['Speaker'].fillna(value='CommitteeOrganization')
dataset['Speaker'] = speaker
So we have our data like:
Index ARI Flesch Kincaid Speaker Score
0 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN
1 8.276460 64.478573 9.034156 WilliamDudley 1.670275
2 19.570911 27.362067 17.253580 JanetYellen -0.604757
3 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN
4 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN
Group by with the Pandas function:
datasetGrouped = dataset.groupby(by='Speaker').mean()
So we have our data like:
Speaker ARI Flesch Kincaid Score
CommitteeOrganization -2.090000 121.220000 -3.400000 NaN
JanetYellen 19.570911 27.362067 17.253580 -0.604757
JeromeH.Powell 12.624198 52.220468 11.403157 -1.350798
LaelBrainard 18.798471 26.657992 16.858818 -0.001191
StanleyFischer 17.766196 33.889406 16.157089 0.582183
WilliamDudley 8.276460 64.478573 9.034156 1.670275
Compute the Standard Deviations for each columns:
aristd = datasetGrouped['ARI'].std()
fleschstd = datasetGrouped['Flesch'].std()
kincaidstd = datasetGrouped['Kincaid'].std()
Replace the values with 'NaN' on the rows that meets the condition:
datasetGrouped.loc[abs(datasetGrouped.ARI) > aristd*3,'ARI'] = 'NaN'
datasetGrouped.loc[abs(datasetGrouped.Flesch) > fleschstd*3,'Flesch'] = 'NaN'
datasetGrouped.loc[abs(datasetGrouped.Kincaid) > kincaidstd*3,'Kincaid'] = 'NaN'
The final dataset is:
Speaker ARI Flesch Kincaid Score
CommitteeOrganization -2.090000 NaN -3.400000 NaN
JanetYellen 19.570911 27.3621 17.253580 -0.604757
JeromeH.Powell 12.624198 52.2205 11.403157 -1.350798
LaelBrainard 18.798471 26.658 16.858818 -0.001191
StanleyFischer 17.766196 33.8894 16.157089 0.582183
WilliamDudley 8.276460 64.4786 9.034156 1.670275
Full code available on: Github
Note: This could be done in less code than presented, but the answer it's done "step by step" for easy understanding.
Note2: Because the question was a little ambiguous, if I didn't understand well something and don't provide the right answer, don't hesitate to tell me and I'll update the answer if possible
Upvotes: 2