Reputation: 608
I am working with a data-frame in R. I have the following function which removes all rows of a data-frame df
where, for a specified column index/attribute, the value at that row is outside mean (of column) plus or minus n*stdev (of column).
remove_outliers <- function(df,attr,n){
outliersgone <- df[df[,attr]<=(mean(df[,attr],na.rm=TRUE)+n*sd(df[,attr],na.rm=TRUE)) & df[,attr]>=(mean(df[,attr],na.rm=TRUE)-n*sd(df[,attr],na.rm=TRUE)),]
return(outliersgone)
}
There are two parts to my question.
(1) My data-frame df
also has a column 'Group', which specifies a class label. I would like to be able to remove outliers according to mean and standard deviation within their group within the column, i.e. organised by factor (within the column). So you would remove from the data-frame a row labelled with group A if, in the specified column/attribute, the value at that row is outside mean (of group A rows in that column) plus/minus n*stdev (of group A rows in that column). And the same for groups B, C, D, E, F, etc.
How can I do this? (Preferably using only base R and dplyr.) I have tried to use df %>% group_by(Group)
followed by mutate
but I'm not sure what to pass to mutate, given my function remove_outliers
seems to require the whole data-frame to be passed into it (so it can return the whole data-frame with rows only removed based on the chosen attribute attr
).
I am open to hearing suggestions for changing the function remove_outliers
as well, as long as they also return the whole data-frame as explained. I'd prefer solutions that avoid loops if possible (unless inevitable and no more efficient method presents itself in base R / dplyr).
(2) Is there a straightforward way I could combine outlier considerations across multiple columns? e.g. remove from the dataframe df
those rows which are outliers wrt at least $N$ attributes out of a specified vector of attributes/column indices (length≥N). or a more complex condition like, remove from the dataframe df
those rows which are outliers wrt Attribute 1 and at least 2 of Attributes 2,4,6,8.
(Ideally the definition of outlier would again be within-group within column, as specified in question 1 above, but a solution working in terms of just within column without considering the groups would also be useful for me.)
Upvotes: 1
Views: 906
Reputation: 872
Ok - part 1 (and trying to avoid loops wherever possible):
Here's some test data:
test_data=data.frame(
group=c(rep("a",100),rep("b",100)),
value=rnorm(200)
)
We'll find the groups:
groups=levels(test_data[,1]) # or unique(test_data[,1]) if it isn't a factor
And we'll calculate the outlier limits (here I'm specifying only 1 sd) - sorry for the loop, but it's only over the groups, not the data:
outlier_sds=1
outlier_limits=sapply(groups,function(g) {
m=mean(test_data[test_data[,1]==g,2])
s=sd(test_data[test_data[,1]==g,2])
return(c(m-outlier_sds*s,m+outlier_sds*s))
})
So we can define the limits for each row of test_data
:
test_data_limits=outlier_limits[,test_data[,1]]
And use this to determine the outliers:
outliers=test_data[,2]<test_data_limits[1,] | test_data[,2]>test_data_limits[2,]
(or, combining those last steps):
outliers=test_data[,2]<outlier_limits[1,test_data[,1]] | test_data[,2]>outlier_limits[2,test_data[,1]]
Finally:
test_data_without_outliers=test_data[!outliers,]
EDIT: now part 2 (apply part 1 with a loop over all the columns in the data):
Some test data with more than one column of values:
test_data2=data.frame(
group=c(rep("a",100),rep("b",100)),
value1=rnorm(200),
value2=2*rnorm(200),
value3=3*rnorm(200)
)
Combine all the steps of part 1 into a new function find_outliers
that returns a logical vector indicating whether any value is an outlier for its respective column & group:
find_outliers = function(values,n_sds,groups) {
group_names=levels(groups)
outlier_limits=sapply(group_names,function(g) {
m=mean(values[groups==g])
s=sd(values[groups==g])
return(c(m-n_sds*s,m+n_sds*s))
})
return(values < outlier_limits[1,groups] | values > outlier_limits[2,groups])
}
And then apply this function to each of the data columns:
test_groups=test_data2[,1]
test_data_outliers=apply(test_data2[,-1],2,function(d) find_outliers(values=d,n_sds=1,groups=test_groups))
The rowSums
of test_data_outliers
indicate how many times each row is considered an 'outlier' in the various columns, with respect to its own group:
rowSums(test_data_outliers)
Upvotes: 1