Remove outlier rows by column and factor in R

Question

I am working with a data-frame in R. I have the following function which removes all rows of a data-frame df where, for a specified column index/attribute, the value at that row is outside mean (of column) plus or minus n*stdev (of column).

remove_outliers <- function(df,attr,n){
  outliersgone <- df[df[,attr]<=(mean(df[,attr],na.rm=TRUE)+n*sd(df[,attr],na.rm=TRUE)) & df[,attr]>=(mean(df[,attr],na.rm=TRUE)-n*sd(df[,attr],na.rm=TRUE)),]
  return(outliersgone)
}

There are two parts to my question.

(1) My data-frame df also has a column 'Group', which specifies a class label. I would like to be able to remove outliers according to mean and standard deviation within their group within the column, i.e. organised by factor (within the column). So you would remove from the data-frame a row labelled with group A if, in the specified column/attribute, the value at that row is outside mean (of group A rows in that column) plus/minus n*stdev (of group A rows in that column). And the same for groups B, C, D, E, F, etc.

How can I do this? (Preferably using only base R and dplyr.) I have tried to use df %>% group_by(Group) followed by mutate but I'm not sure what to pass to mutate, given my function remove_outliers seems to require the whole data-frame to be passed into it (so it can return the whole data-frame with rows only removed based on the chosen attribute attr).

I am open to hearing suggestions for changing the function remove_outliers as well, as long as they also return the whole data-frame as explained. I'd prefer solutions that avoid loops if possible (unless inevitable and no more efficient method presents itself in base R / dplyr).

(2) Is there a straightforward way I could combine outlier considerations across multiple columns? e.g. remove from the dataframe df those rows which are outliers wrt at least $N$ attributes out of a specified vector of attributes/column indices (length≥N). or a more complex condition like, remove from the dataframe df those rows which are outliers wrt Attribute 1 and at least 2 of Attributes 2,4,6,8.

(Ideally the definition of outlier would again be within-group within column, as specified in question 1 above, but a solution working in terms of just within column without considering the groups would also be useful for me.)

Remove outlier rows by column and factor in R

Answers (1)

Related Questions