Reputation: 1093
I'm beginning to learn R and data science in general.
I have a data frame and most of my variables and the class I want to predict are discrete.
What I need to do is find outliers in this data so I can deal with them by imputation or whatever.
Some methods I researched were to use IQR (Inter Quartile Range), Cook's distance or use the 'outliers' package, but it seems most of them can only be applied to continuous data, so R gave me errors saying that it could not be applied to factors, in this case, discrete data I suppose.
One of the errors I got when using the 'outliers' package.
Error in Summary.factor(c(6L, 6L, 8L, 6L, 7L, 7L, 6L, 9L, 12L, 12L, 12L, : 'max' not meaningful for factors
Am I doing something wrong here? Can someone help? Any help is appreciated, thanks.
Upvotes: 2
Views: 1204
Reputation: 3228
In a formal sense, there is no such thing as an outlier detection method for categorical data, at least not an inferential test of such. In many cases it wouldn't make sense to. The best you can do is observe the data with descriptive methods and assess whether something is rare in some sense, but even then it may be erroneous to exclude unless it is entered truly in error (for example a Gender category having random zeroes entered). As an example, lets plot the marital status of respondents to the GSS survey from the forcats
package:
#### Load Library ####
library(tidyverse)
#### Plot Categorical Data ####
gss_cat %>%
ggplot(aes(x=marital))+
geom_bar(fill = "steelblue")
You will discover this...there is a very tiny sliver of people accounted for in the No Answer category:
If we inspect the counts of the data:
#### Inspect Counts ####
gss_cat %>%
select(marital) %>%
group_by(marital) %>%
summarise(Count = n())
We find 17 people didn't answer this question:
# A tibble: 6 × 2
marital Count
<fct> <int>
1 No answer 17
2 Never married 5416
3 Separated 743
4 Divorced 3383
5 Widowed 1807
6 Married 10117
You may think at this point that this data is useless and should be discarded, however remember that these are people who answered the survey as a whole, so their answers indicate that they may have been uncomfortable providing information. Lets also say that for arguments sake the Separated category had 10 observations. Do we throw this out? It would be throwing very useful information out if we did.
A good writeup on this subject can be found here.
Upvotes: 0