Reputation: 139
I have a set of strings in a R variable, when I check the class, it says it is a factor. eg.
mySet<-c("abc","abc","def","abc","def","efg","abc")
I want to get the string which occurs the maximum number of times in this set(i.e."abc" in this case).
I understand one approach is to use the hist()
but I am facing data type issues and since I'm new to R I wasn't able to crack this one by myself.
Upvotes: 6
Views: 17253
Reputation: 521
repeated <- function(x) as(names(which.max(table(x))), mode(x)) repeated(a) where a is a vector of either words or numbers
Upvotes: 0
Reputation: 193497
Depending on the size of your data and the frequency at which you need to do such an exercise, you might want to spend some time writing a more efficient function. Underlying table
is tabulate
, which is much faster, and can thus lead to a function like the following:
MaxTable <- function(InVec, mult = FALSE) {
if (!is.factor(InVec)) InVec <- factor(InVec)
A <- tabulate(InVec)
if (isTRUE(mult)) {
levels(InVec)[A == max(A)]
}
else levels(InVec)[which.max(A)]
}
This function is designed to also identify when there are multiple values for the max values. Compare the following:
mySet <- c("A", "A", "A", "B", "B", "B", "C", "C")
## Your question indicates that you have factors,
## but your sample code is a character vector
mySetF <- factor(mySet) ## Just as an example
## @BrodieG's answer
fun1 <- function(InVec) {
names(which.max(table(InVec)))
}
## @sgibb's answer
fun2 <- function(InVec) {
m <- which.max(table(as.character(InVec)))
as.character(InVec)[m]
}
fun1(mySet)
# [1] "A"
fun2(mySet)
# [1] "A"
MaxTable(mySet)
# [1] "A"
MaxTable(mySet, mult = TRUE)
# [1] "A" "B"
library(microbenchmark)
microbenchmark(fun1(mySet), fun2(mySet), MaxTable(mySet), MaxTable(mySetF))
# Unit: microseconds
# expr min lq median uq max neval
# fun1(mySet) 291.457 297.1845 302.2080 313.1235 3008.108 100
# fun2(mySet) 296.388 302.0775 311.3170 321.5260 1367.137 100
# MaxTable(mySet) 172.463 180.8755 184.8355 189.9700 1947.700 100
# MaxTable(mySetF) 34.510 38.1545 44.6045 46.6695 95.341 100
At the small vector level, this function is more efficient. This is even more obvious with factor
vectors. How about with bigger vectors?
set.seed(1)
medSet <- sample(c(LETTERS, letters), 1e5, TRUE)
medSetF <- factor(medSet)
fun1(medSet)
# [1] "E"
fun2(medSet) ### Wrong Answer!!!
# [1] "D"
MaxTable(medSet)
# [1] "E"
microbenchmark(fun1(medSet), MaxTable(medSet), MaxTable(medSetF))
# Unit: microseconds
# expr min lq median uq max neval
# fun1(medSet) 14222.846 14350.957 14484.4490 14600.490 34810.174 100
# MaxTable(medSet) 7787.761 7860.248 7917.3455 8019.068 9762.884 100
# MaxTable(medSetF) 501.733 529.257 570.0735 587.936 1469.994 100
I've dropped @sgibb's function from the benchmarks (it runs in about the same time as fun1()
) since it returns the wrong answer.
One last benchmark....
set.seed(3)
bigSet <- sample(c(LETTERS, letters), 1e7, TRUE)
bigSetF <- factor(bigSet)
microbenchmark(fun1(bigSet), MaxTable(bigSet), MaxTable(bigSetF), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1(bigSet) 1519.37503 1612.10290 1648.36473 1789.02965 1932.41073 10
# MaxTable(bigSet) 782.01856 791.86408 834.35764 894.60535 1019.28747 10
# MaxTable(bigSetF) 48.56459 48.76492 49.25444 49.93911 50.20404 10
Upvotes: 14