Reputation: 53
I'm trying to implement the AVF algorithm for outlier detection in categorical data (A. Koufakou, E.G. Ortiz, et al http://www.enriquegortiz.com/publications/outlierDetection_ictai07.pdf) in R. My data set is ~ 500,000 rows, 13 variables (all categorical).
The pseudocode is pretty simple:
# Label all data points as non-outliers
# calculate frequency of each attribute value
# foreach point x
# AVFscore = Sum(frequency each attrib. value in x)/num.attribs
# end foreach
# return top k outliers with minimum AVFscore
To get the frequency of each attribute value, I used
freq_matrix <- apply(mydata, MARGIN = 2, FUN = count) # from plyr
which gave me a list of dataframes, one per variable with each variable's frequency. So far so good.
I'm stuck on figuring out how to iterate over each row and get the 'AVFscore' - I'm certain I need to use apply() but can't wrap my head around how it should work. Basically for each row, I need to look up the frequency of each value from my freq_matrix and sum them, then divide by the number of variables(i.e. 13).
Example:
Country_cde Flag1 Flag2 Score
IE A X 9/13
IE B X 7/13
US A X 8/13
US A Y 6/13
IE C Z 5/13
So I know that for country_cde, IE's frequency is 3, US is 2. For Flag1, A is 3, B is 1, C is 1, etc. In this example, the final row has the lowest score so would be a likely outlier.
Upvotes: 5
Views: 3172
Reputation: 15784
Base R approach:
mydata <- read.table(text="Country_cde Flag1 Flag2
IE A X
IE B X
US A X
US A Y
IE C Z",header=T,stringsAsFactors=F)
freq_matrix <- table( unlist( unname(mydata) ) ) # Other way to count the occurrences
mydata[,"Score"] <- apply( mydata,1, function(x) { paste0( sum(freq_matrix[x]) ,"/", length(x) )}) # Do the sum, paste with number of cols (should be computed outside to avoid cache miss)
Output:
> mydata
Country_cde Flag1 Flag2 Score
1 IE A X 9/3
2 IE B X 7/3
3 US A X 8/3
4 US A Y 6/3
5 IE C Z 5/3
If you want the actual division value, remove paste0 like this:
mydata[,"Score"] <- apply(mydata,1,function(x) { sum(freq_matrix[x]) / length(x) })
Upvotes: 6