xythum
xythum

Reputation: 53

Attribute Value Frequency in R (outliers in categorical variables)

I'm trying to implement the AVF algorithm for outlier detection in categorical data (A. Koufakou, E.G. Ortiz, et al http://www.enriquegortiz.com/publications/outlierDetection_ictai07.pdf) in R. My data set is ~ 500,000 rows, 13 variables (all categorical).

The pseudocode is pretty simple:

# Label all data points as non-outliers
# calculate frequency of each attribute value
# foreach point x
#   AVFscore = Sum(frequency each attrib. value in x)/num.attribs
# end foreach
# return top k outliers with minimum AVFscore

To get the frequency of each attribute value, I used

freq_matrix <- apply(mydata, MARGIN = 2, FUN = count) # from plyr

which gave me a list of dataframes, one per variable with each variable's frequency. So far so good.

I'm stuck on figuring out how to iterate over each row and get the 'AVFscore' - I'm certain I need to use apply() but can't wrap my head around how it should work. Basically for each row, I need to look up the frequency of each value from my freq_matrix and sum them, then divide by the number of variables(i.e. 13).

Example:

Country_cde    Flag1     Flag2      Score
IE               A         X         9/13
IE               B         X         7/13
US               A         X         8/13
US               A         Y         6/13
IE               C         Z         5/13

So I know that for country_cde, IE's frequency is 3, US is 2. For Flag1, A is 3, B is 1, C is 1, etc. In this example, the final row has the lowest score so would be a likely outlier.

Upvotes: 5

Views: 3172

Answers (1)

Tensibai
Tensibai

Reputation: 15784

Base R approach:

mydata <- read.table(text="Country_cde    Flag1     Flag2
IE               A         X
IE               B         X
US               A         X
US               A         Y
IE               C         Z",header=T,stringsAsFactors=F)

freq_matrix <- table( unlist( unname(mydata) ) ) # Other way to count the occurrences

mydata[,"Score"] <- apply( mydata,1, function(x) { paste0( sum(freq_matrix[x]) ,"/", length(x) )}) # Do the sum, paste with number of cols (should be computed outside to avoid cache miss)

Output:

> mydata
  Country_cde Flag1 Flag2 Score
1          IE     A     X   9/3
2          IE     B     X   7/3
3          US     A     X   8/3
4          US     A     Y   6/3
5          IE     C     Z   5/3

If you want the actual division value, remove paste0 like this:

mydata[,"Score"] <- apply(mydata,1,function(x) { sum(freq_matrix[x]) / length(x) })

Upvotes: 6

Related Questions