Renato Borges
Renato Borges

Reputation: 1093

How to find outliers in data with discrete variables in R

I'm beginning to learn R and data science in general.

I have a data frame and most of my variables and the class I want to predict are discrete.

What I need to do is find outliers in this data so I can deal with them by imputation or whatever.

Some methods I researched were to use IQR (Inter Quartile Range), Cook's distance or use the 'outliers' package, but it seems most of them can only be applied to continuous data, so R gave me errors saying that it could not be applied to factors, in this case, discrete data I suppose.

One of the errors I got when using the 'outliers' package.

Error in Summary.factor(c(6L, 6L, 8L, 6L, 7L, 7L, 6L, 9L, 12L, 12L, 12L,  : 'max' not meaningful for factors

Am I doing something wrong here? Can someone help? Any help is appreciated, thanks.

Upvotes: 2

Views: 1204

Answers (1)

Shawn Hemelstrand
Shawn Hemelstrand

Reputation: 3228

In a formal sense, there is no such thing as an outlier detection method for categorical data, at least not an inferential test of such. In many cases it wouldn't make sense to. The best you can do is observe the data with descriptive methods and assess whether something is rare in some sense, but even then it may be erroneous to exclude unless it is entered truly in error (for example a Gender category having random zeroes entered). As an example, lets plot the marital status of respondents to the GSS survey from the forcats package:

#### Load Library ####
library(tidyverse)

#### Plot Categorical Data ####
gss_cat %>% 
  ggplot(aes(x=marital))+
  geom_bar(fill = "steelblue") 

You will discover this...there is a very tiny sliver of people accounted for in the No Answer category:

enter image description here

If we inspect the counts of the data:

#### Inspect Counts ####
gss_cat %>% 
  select(marital) %>%
  group_by(marital) %>% 
  summarise(Count = n())

We find 17 people didn't answer this question:

# A tibble: 6 × 2
  marital       Count
  <fct>         <int>
1 No answer        17
2 Never married  5416
3 Separated       743
4 Divorced       3383
5 Widowed        1807
6 Married       10117

You may think at this point that this data is useless and should be discarded, however remember that these are people who answered the survey as a whole, so their answers indicate that they may have been uncomfortable providing information. Lets also say that for arguments sake the Separated category had 10 observations. Do we throw this out? It would be throwing very useful information out if we did.

A good writeup on this subject can be found here.

Upvotes: 0

Related Questions