how to remove outliers in a dataframe based on a categorical variable in R

Question

I have a data set of transactional data like the following:

 prodid  priceperitem       date
   62420         18.9  2014-10-09
   62420         29.9  2014-09-20
   62420         18.9  2014-10-11
   62420         27.9  2014-07-04
   62420         18.9  2014-08-25
   62420         18.9  2014-11-01

And I would like remove the outliers of prices for each product.

I have tried the following code, and it has replaced the prices that is a outlier with NA for each product. But this is a separate list and is grouped by prodid and I would want the the data variable to stay in the dataframe rather than having a new list.

remove.outliers <- tapply(priceperitem, prodid, function(x) {
     qnt <- quantile(x, probs=c(.25, .75))
      H <- 1.5 * IQR(x)
      y <- x
      y[x < (qnt[1] - H)] <- NA
      y[x > (qnt[2] + H)] <- NA
      y
})

And this code will give me some output like the following:

$205780229

 [1]   NA 10.9 10.5 10.9 10.9   NA ....

It is a new array but what I want is like the following:

     prodid       priceperitem    date
    205780229    NA              2014-10-03
    205780229    10.9            2014-10-20
    205780229    10.5            2014-10-30
    205780229    10.9            2014-5-23
    205780229    10.9            2014-11-20
....

Jellen Vermeir · Accepted Answer

You can use the by function in order to group the dataframe in smaller subsets and subsequently perform function calls on the individual subgroups. During these function calls you can easily remove the outliers from each of the subsets and return the results. Next, you can obtain the resulting dataframe by merging the subresults together.

I'll use the following dataframe to illustrate with an example:

prodid <- c(rep(62420,5),rep(62421,5))
pricePerItem <- c(18,18.1,23,17.9,18.0,51.7,22,51,52,52.2)
dates <- rep(Sys.time(),10)
products <- data.frame(prodid,pricePerItem,dates)
products
   prodid pricePerItem               dates
1   62420         18.0 2015-07-06 01:51:31
2   62420         18.1 2015-07-06 01:51:31
3   62420         23.0 2015-07-06 01:51:31
4   62420         17.9 2015-07-06 01:51:31
5   62420         18.0 2015-07-06 01:51:31
6   62421         51.7 2015-07-06 01:51:31
7   62421         22.0 2015-07-06 01:51:31
8   62421         51.0 2015-07-06 01:51:31
9   62421         52.0 2015-07-06 01:51:31
10  62421         52.2 2015-07-06 01:51:31

We group the dataframe by prodid and filter out the relevant outliers. We finish by merging the results:

   result <- by(products,products$prodid,function(product) {
      qnt <- quantile(product$pricePerItem, probs=c(.25, .75))
      H <- 1.5 * IQR(product$pricePerItem)
      outlierCheck <- (product$pricePerItem) > qnt[1]-H & (product$pricePerItem



SMALL EDIT
I noticed that you want to replace the outliers with an NA value instead of removing them completely. You can obviously accomplish this behaviour in a similar manner. For example:

result <- by(products,products$prodid,function(product) {
  qnt <- quantile(product$pricePerItem, probs=c(.25, .75))
  H <- 1.5 * IQR(product$pricePerItem)

  outliers <- (product$pricePerItem) < qnt[1]-H | (product$pricePerItem > qnt[2]+H)
  product[outliers,2] <- NA
  product
})

filteredFrame <- do.call("rbind",result)
filteredFrame
         prodid pricePerItem               dates
62420.1   62420         18.0 2015-07-06 02:14:06
62420.2   62420         18.1 2015-07-06 02:14:06
62420.3   62420           NA 2015-07-06 02:14:06
62420.4   62420         17.9 2015-07-06 02:14:06
62420.5   62420         18.0 2015-07-06 02:14:06
62421.6   62421         51.7 2015-07-06 02:14:06
62421.7   62421           NA 2015-07-06 02:14:06
62421.8   62421         51.0 2015-07-06 02:14:06
62421.9   62421         52.0 2015-07-06 02:14:06
62421.10  62421         52.2 2015-07-06 02:14:06

how to remove outliers in a dataframe based on a categorical variable in R

Answers (2)

Related Questions