lll
lll

Reputation: 1109

how to remove outliers in a dataframe based on a categorical variable in R

I have a data set of transactional data like the following:

 prodid  priceperitem       date
   62420         18.9  2014-10-09
   62420         29.9  2014-09-20
   62420         18.9  2014-10-11
   62420         27.9  2014-07-04
   62420         18.9  2014-08-25
   62420         18.9  2014-11-01

And I would like remove the outliers of prices for each product.

I have tried the following code, and it has replaced the prices that is a outlier with NA for each product. But this is a separate list and is grouped by prodid and I would want the the data variable to stay in the dataframe rather than having a new list.

remove.outliers <- tapply(priceperitem, prodid, function(x) {
     qnt <- quantile(x, probs=c(.25, .75))
      H <- 1.5 * IQR(x)
      y <- x
      y[x < (qnt[1] - H)] <- NA
      y[x > (qnt[2] + H)] <- NA
      y
})

And this code will give me some output like the following:

$205780229

 [1]   NA 10.9 10.5 10.9 10.9   NA ....

It is a new array but what I want is like the following:

     prodid       priceperitem    date
    205780229    NA              2014-10-03
    205780229    10.9            2014-10-20
    205780229    10.5            2014-10-30
    205780229    10.9            2014-5-23
    205780229    10.9            2014-11-20
....

Upvotes: 1

Views: 3298

Answers (2)

Jellen Vermeir
Jellen Vermeir

Reputation: 1720

You can use the by function in order to group the dataframe in smaller subsets and subsequently perform function calls on the individual subgroups. During these function calls you can easily remove the outliers from each of the subsets and return the results. Next, you can obtain the resulting dataframe by merging the subresults together.

I'll use the following dataframe to illustrate with an example:

prodid <- c(rep(62420,5),rep(62421,5))
pricePerItem <- c(18,18.1,23,17.9,18.0,51.7,22,51,52,52.2)
dates <- rep(Sys.time(),10)
products <- data.frame(prodid,pricePerItem,dates)
products
   prodid pricePerItem               dates
1   62420         18.0 2015-07-06 01:51:31
2   62420         18.1 2015-07-06 01:51:31
3   62420         23.0 2015-07-06 01:51:31
4   62420         17.9 2015-07-06 01:51:31
5   62420         18.0 2015-07-06 01:51:31
6   62421         51.7 2015-07-06 01:51:31
7   62421         22.0 2015-07-06 01:51:31
8   62421         51.0 2015-07-06 01:51:31
9   62421         52.0 2015-07-06 01:51:31
10  62421         52.2 2015-07-06 01:51:31

We group the dataframe by prodid and filter out the relevant outliers. We finish by merging the results:

   result <- by(products,products$prodid,function(product) {
      qnt <- quantile(product$pricePerItem, probs=c(.25, .75))
      H <- 1.5 * IQR(product$pricePerItem)
      outlierCheck <- (product$pricePerItem) > qnt[1]-H & (product$pricePerItem<qnt[2]+H)
      noOutliers <- product[outlierCheck,]
    })

    filteredFrame <- do.call("rbind",result)
    filteredFrame
             prodid pricePerItem               dates
    62420.1   62420         18.0 2015-07-06 01:51:31
    62420.2   62420         18.1 2015-07-06 01:51:31
    62420.4   62420         17.9 2015-07-06 01:51:31
    62420.5   62420         18.0 2015-07-06 01:51:31
    62421.6   62421         51.7 2015-07-06 01:51:31
    62421.8   62421         51.0 2015-07-06 01:51:31
    62421.9   62421         52.0 2015-07-06 01:51:31
    62421.10  62421         52.2 2015-07-06 01:51:31

SMALL EDIT I noticed that you want to replace the outliers with an NA value instead of removing them completely. You can obviously accomplish this behaviour in a similar manner. For example:

result <- by(products,products$prodid,function(product) {
  qnt <- quantile(product$pricePerItem, probs=c(.25, .75))
  H <- 1.5 * IQR(product$pricePerItem)

  outliers <- (product$pricePerItem) < qnt[1]-H | (product$pricePerItem > qnt[2]+H)
  product[outliers,2] <- NA
  product
})

filteredFrame <- do.call("rbind",result)
filteredFrame
         prodid pricePerItem               dates
62420.1   62420         18.0 2015-07-06 02:14:06
62420.2   62420         18.1 2015-07-06 02:14:06
62420.3   62420           NA 2015-07-06 02:14:06
62420.4   62420         17.9 2015-07-06 02:14:06
62420.5   62420         18.0 2015-07-06 02:14:06
62421.6   62421         51.7 2015-07-06 02:14:06
62421.7   62421           NA 2015-07-06 02:14:06
62421.8   62421         51.0 2015-07-06 02:14:06
62421.9   62421         52.0 2015-07-06 02:14:06
62421.10  62421         52.2 2015-07-06 02:14:06

Upvotes: 2

Ken Benoit
Ken Benoit

Reputation: 14912

The most straightforward way is to replace the index for your variable to be NAed using an index of logicals based on a function to detect the outlier. In the code below, I have augmented your original example dataset with two outliers, one low and one high.

myData <- read.table(text = "prodid  priceperitem       date
   62420         18.9  2014-10-09
   62420         29.9  2014-09-20
   62420         18.9  2014-10-11
   62420         27.9  2014-07-04
   62420         18.9  2014-08-25
   62420         18.9  2014-11-01
   62420         3     2014-11-01
   62420         50    2014-11-01", header=TRUE)

# function to return a logical for outlier status, from a numeric vector
identifyOutliers <-  function(x) {
    qnt <- quantile(x, probs=c(.25, .75))
    H <- 1.5 * IQR(x)
    outlier <- (x < (qnt[1] - H)) | (x > qnt[2] + H)
    outlier
}

# so:
which(identifyOutliers(myData$priceperitem))
## [1] 7 8

# copy the data to a new object
myDataNew <- myData
# replace the priceperitem with NA using the index
myDataNew$priceperitem[identifyOutliers(myData$priceperitem)] <- NA
myDataNew
##   prodid priceperitem       date
## 1  62420         18.9 2014-10-09
## 2  62420         29.9 2014-09-20
## 3  62420         18.9 2014-10-11
## 4  62420         27.9 2014-07-04
## 5  62420         18.9 2014-08-25
## 6  62420         18.9 2014-11-01
## 7  62420           NA 2014-11-01
## 8  62420           NA 2014-11-01

Upvotes: 1

Related Questions