Brian Jackson
Brian Jackson

Reputation: 409

What is the fastest way to get the frequency of a factor in R?

I am trying to filter out low frequency factors in a data set. The problem looks something like this:

require(digest)
require(ff)
require(ffbase)

test.vector.ffdf = as.ffdf(as.ff(as.factor(sample(sapply(1:1000, digest), 50000000, replace = T))))

get.frequency=function(i,column){   
  freq = sum(test.vector.ffdf[,column] == i)/length(test.vector.ffdf[,column])
  print(paste0(i,' ',freq))
  freq
}

column = 1
sapply(unique(test.vector.ffdf[,column]),get.frequency, column = column)

As you can see this takes a very long time and I have a number of columns to do this to with thousands of factors. Is there any way to retrieve frequencies much much faster?

Clarification: in this example, the print() in the function is just to see progress and the sapply would be used to get a list of frequencies that could be acted on ie [i where freq < 0.001]

Upvotes: 0

Views: 1141

Answers (2)

Brian Jackson
Brian Jackson

Reputation: 409

I was trying a couple of different methods from What is the fastest way to obtain frequencies of integers in a vector? which of course are ints, not characters -

require(digest)
require(ff)
require(ffbase)

test.vector.ffdf = as.ffdf(as.ff(as.factor(sample(sapply(1:10, digest), 50000000, replace = T))))

get.frequency=function(i,column){   
  freq = sum(test.vector.ffdf[,column] == i)/length(test.vector.ffdf[,column])
  #print(paste0(i,' ',freq))
  freq
}

column = 1

x = test.vector.ffdf[,column]

system.time(table(x))
#   user  system elapsed 
#  3.548   0.000   3.561 

system.time(sapply(unique(test.vector.ffdf[,column]),get.frequency, column = column))
#   user  system elapsed 
# 39.049   5.127  44.322 

system.time({cdf<-cbind(sort(x),seq_along(x)); cdf<-cdf[!duplicated(cdf[,1]),2]; c(cdf[-1],length(x)+1)-cdf})
#   user  system elapsed 
#217.060   2.851 220.865 

edit: adding the solution above so it can be compared on the same system:

test.vector.ffdf$one <- ff(1L, length = nrow(test.vector.ffdf))
> system.time(binned_sum(x = test.vector.ffdf$one, bin = test.vector.ffdf$x))
#   user  system elapsed 
#  0.731   0.283   1.018

So it looks like table is the clear winner, and isnt affected by the number of factors like my solution was.

Upvotes: 0

user1600826
user1600826

Reputation:

require(digest)
require(ff)
require(ffbase)

test.vector.ffdf = as.ffdf(as.ff(as.factor(sample(sapply(1:10, digest), 50000000, replace = T))))
test.vector.ffdf$one <- ff(1L, length = nrow(test.vector.ffdf))
system.time(binned_sum(x = test.vector.ffdf$one, bin = test.vector.ffdf$x))
# user  system elapsed 
# 1.463   0.372   1.835 

Upvotes: 1

Related Questions