Reputation: 1109
I have a data set of transactional data like the following:
prodid priceperitem date
62420 18.9 2014-10-09
62420 29.9 2014-09-20
62420 18.9 2014-10-11
62420 27.9 2014-07-04
62420 18.9 2014-08-25
62420 18.9 2014-11-01
And I would like remove the outliers of prices for each product.
I have tried the following code, and it has replaced the prices that is a outlier with NA for each product. But this is a separate list and is grouped by prodid and I would want the the data variable to stay in the dataframe rather than having a new list.
remove.outliers <- tapply(priceperitem, prodid, function(x) {
qnt <- quantile(x, probs=c(.25, .75))
H <- 1.5 * IQR(x)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
})
And this code will give me some output like the following:
$205780229
[1] NA 10.9 10.5 10.9 10.9 NA ....
It is a new array but what I want is like the following:
prodid priceperitem date
205780229 NA 2014-10-03
205780229 10.9 2014-10-20
205780229 10.5 2014-10-30
205780229 10.9 2014-5-23
205780229 10.9 2014-11-20
....
Upvotes: 1
Views: 3298
Reputation: 1720
You can use the by
function in order to group the dataframe in smaller subsets and subsequently perform function calls on the individual subgroups. During these function calls you can easily remove the outliers from each of the subsets and return the results. Next, you can obtain the resulting dataframe by merging the subresults together.
I'll use the following dataframe to illustrate with an example:
prodid <- c(rep(62420,5),rep(62421,5))
pricePerItem <- c(18,18.1,23,17.9,18.0,51.7,22,51,52,52.2)
dates <- rep(Sys.time(),10)
products <- data.frame(prodid,pricePerItem,dates)
products
prodid pricePerItem dates
1 62420 18.0 2015-07-06 01:51:31
2 62420 18.1 2015-07-06 01:51:31
3 62420 23.0 2015-07-06 01:51:31
4 62420 17.9 2015-07-06 01:51:31
5 62420 18.0 2015-07-06 01:51:31
6 62421 51.7 2015-07-06 01:51:31
7 62421 22.0 2015-07-06 01:51:31
8 62421 51.0 2015-07-06 01:51:31
9 62421 52.0 2015-07-06 01:51:31
10 62421 52.2 2015-07-06 01:51:31
We group the dataframe by prodid
and filter out the relevant outliers. We finish by merging the results:
result <- by(products,products$prodid,function(product) {
qnt <- quantile(product$pricePerItem, probs=c(.25, .75))
H <- 1.5 * IQR(product$pricePerItem)
outlierCheck <- (product$pricePerItem) > qnt[1]-H & (product$pricePerItem<qnt[2]+H)
noOutliers <- product[outlierCheck,]
})
filteredFrame <- do.call("rbind",result)
filteredFrame
prodid pricePerItem dates
62420.1 62420 18.0 2015-07-06 01:51:31
62420.2 62420 18.1 2015-07-06 01:51:31
62420.4 62420 17.9 2015-07-06 01:51:31
62420.5 62420 18.0 2015-07-06 01:51:31
62421.6 62421 51.7 2015-07-06 01:51:31
62421.8 62421 51.0 2015-07-06 01:51:31
62421.9 62421 52.0 2015-07-06 01:51:31
62421.10 62421 52.2 2015-07-06 01:51:31
SMALL EDIT
I noticed that you want to replace the outliers with an NA
value instead of removing them completely. You can obviously accomplish this behaviour in a similar manner. For example:
result <- by(products,products$prodid,function(product) {
qnt <- quantile(product$pricePerItem, probs=c(.25, .75))
H <- 1.5 * IQR(product$pricePerItem)
outliers <- (product$pricePerItem) < qnt[1]-H | (product$pricePerItem > qnt[2]+H)
product[outliers,2] <- NA
product
})
filteredFrame <- do.call("rbind",result)
filteredFrame
prodid pricePerItem dates
62420.1 62420 18.0 2015-07-06 02:14:06
62420.2 62420 18.1 2015-07-06 02:14:06
62420.3 62420 NA 2015-07-06 02:14:06
62420.4 62420 17.9 2015-07-06 02:14:06
62420.5 62420 18.0 2015-07-06 02:14:06
62421.6 62421 51.7 2015-07-06 02:14:06
62421.7 62421 NA 2015-07-06 02:14:06
62421.8 62421 51.0 2015-07-06 02:14:06
62421.9 62421 52.0 2015-07-06 02:14:06
62421.10 62421 52.2 2015-07-06 02:14:06
Upvotes: 2
Reputation: 14912
The most straightforward way is to replace the index for your variable to be NA
ed using an index of logicals based on a function to detect the outlier. In the code below, I have augmented your original example dataset with two outliers, one low and one high.
myData <- read.table(text = "prodid priceperitem date
62420 18.9 2014-10-09
62420 29.9 2014-09-20
62420 18.9 2014-10-11
62420 27.9 2014-07-04
62420 18.9 2014-08-25
62420 18.9 2014-11-01
62420 3 2014-11-01
62420 50 2014-11-01", header=TRUE)
# function to return a logical for outlier status, from a numeric vector
identifyOutliers <- function(x) {
qnt <- quantile(x, probs=c(.25, .75))
H <- 1.5 * IQR(x)
outlier <- (x < (qnt[1] - H)) | (x > qnt[2] + H)
outlier
}
# so:
which(identifyOutliers(myData$priceperitem))
## [1] 7 8
# copy the data to a new object
myDataNew <- myData
# replace the priceperitem with NA using the index
myDataNew$priceperitem[identifyOutliers(myData$priceperitem)] <- NA
myDataNew
## prodid priceperitem date
## 1 62420 18.9 2014-10-09
## 2 62420 29.9 2014-09-20
## 3 62420 18.9 2014-10-11
## 4 62420 27.9 2014-07-04
## 5 62420 18.9 2014-08-25
## 6 62420 18.9 2014-11-01
## 7 62420 NA 2014-11-01
## 8 62420 NA 2014-11-01
Upvotes: 1