Nick
Nick

Reputation: 22175

How to find the statistical mode?

In R, mean() and median() are standard functions which do what you'd expect. mode() tells you the internal storage mode of the object, not the value that occurs the most in its argument. But is there is a standard library function that implements the statistical mode for a vector (or list)?

Upvotes: 514

Views: 422981

Answers (30)

Gustavo
Gustavo

Reputation: 31

I came up with a tidyverse solution, using forecats. You can use that inside of summarize

values <- c("A", "A", "B", "C", "C", "C")

fct_infreq(values) |> levels() |> first()

This returns "C"

Upvotes: 0

Sebastian
Sebastian

Reputation: 1369

The generic function fmode() in the collapse package implements an optimized C-level hashing algorithm to computed the (weighted) mode. It is significantly faster than the above approaches, and also supports grouping and multithreading. It comes with methods for vectors, matrices, and (grouped) data.frame-like objects. Syntax:

library(collapse)
fmode(x, g = NULL, w = NULL, ..., ties = "first")

where x can be one of the above objects, g supplies an optional grouping vector or list of grouping vectors (for grouped mode calculations), and w (optionally) supplies a numeric weight vector. ties can be "first", "last", "min" or "max", e.g.

x <- c(1, 3, 2, 2, 4, 4, 1, 7, NA, NA, NA)
fmode(x)                # Default is ties = "first"
#> [1] 2
fmode(x, ties = "last")
#> [1] 1
fmode(x, ties = "min")
#> [1] 1
fmode(x, ties = "max")
#> [1] 4
fmode(x, na.rm = FALSE) # Here NA is the mode
#> [1] NA

As an example of the grouped data frame method, this computes a population weighted first mode of a heterogeneous development indicators dataset by income group (using 4 threads).

wlddev |> fgroup_by(income) |> fmode(POP, nthreads = 4)
#>                income      sum.POP       country iso3c       date year decade
#> 1         High income  58840837058 United States   USA 2020-01-01 2019   2010
#> 2          Low income  20949161394      Ethiopia   ETH 2020-01-01 2019   2010
#> 3 Lower middle income 113837684528         India   IND 2020-01-01 2019   2010
#> 4 Upper middle income 119606023798         China   CHN 2020-01-01 2019   2010
#>                  region  OECD      PCGDP   LIFEEX GINI        ODA
#> 1 Europe & Central Asia  TRUE 55753.1444 78.53902 40.0  -76339996
#> 2    Sub-Saharan Africa FALSE   602.6341 66.59700 35.0 4893290039
#> 3            South Asia FALSE  2151.7260 69.65600 35.7 2608629883
#> 4   East Asia & Pacific FALSE  8242.0546 76.91200 39.7 -559890015

Upvotes: 16

Gaspar
Gaspar

Reputation: 145

One quick way would be to use DescTools::Mode.

Upvotes: 2

Yorgos
Yorgos

Reputation: 30445

There is package modeest which provide estimators of the mode of univariate unimodal (and sometimes multimodal) data and values of the modes of usual probability distributions.

mySamples <- c(19, 4, 5, 7, 29, 19, 29, 13, 25, 19)

library(modeest)
mlv(mySamples, method = "mfv")

Mode (most likely value): 19 
Bickel's modal skewness: -0.1 
Call: mlv.default(x = mySamples, method = "mfv")

For more information see this page

You may also look for "mode estimation" in CRAN Task View: Probability Distributions. Two new packages have been proposed.

Upvotes: 76

GuillaumeL
GuillaumeL

Reputation: 1015

Here is my data.table solution that returns row-wise modes for a complete table. I use it to infer row class. It takes care of the new-ish set() function in data.table and should be pretty fast. It does not manage NA though but that could be added by looking at the numerous other solutions on this page.

majorityVote <- function(mat_classes) {
  #mat_classes = dt.pour.centroids_num
  dt.modes <- data.table(mode = integer(nrow(mat_classes)))
  for (i in 1:nrow(mat_classes)) {
    cur.row <- mat_classes[i]
    cur.mode <- which.max(table(t(cur.row)))
    set(dt.modes, i=i, j="mode", value = cur.mode)
  }

  return(dt.modes)
}

Possible usage:

newClass <- majorityVote(my.dt)  # just a new vector with all the modes

Upvotes: 0

Fredy Yoseph Marianto
Fredy Yoseph Marianto

Reputation: 31

If you ask the built-in function in R, maybe you can find it on package pracma. Inside of that package, there is a function called Mode.

Upvotes: 2

obrl_soil
obrl_soil

Reputation: 1214

Adding in raster::modal() as an option, although note that raster is a hefty package and may not be worth installing if you don't do geospatial work.

The source code could be pulled out of https://github.com/rspatial/raster/blob/master/src/modal.cpp and https://github.com/rspatial/raster/blob/master/R/modal.R into a personal R package, for those who are particularly keen.

Upvotes: 0

Ana Nimbus
Ana Nimbus

Reputation: 735

It seems to me that if a collection has a mode, then its elements can be mapped one-to-one with the natural numbers. So, the problem of finding the mode reduces to producing such a mapping, finding the mode of the mapped values, then mapping back to some of the items in the collection. (Dealing with NA occurs at the mapping phase).

I have a histogram function that operates on a similar principal. (The special functions and operators used in the code presented herein should be defined in Shapiro and/or the neatOveRse. The portions of Shapiro and neatOveRse duplicated herein are so duplicated with permission; the duplicated snippets may be used under the terms of this site.) R pseudocode for histogram is

.histogram <- function (i)
        if (i %|% is.empty) integer() else
        vapply2(i %|% max %|% seqN, `==` %<=% i %O% sum)

histogram <- function(i) i %|% rmna %|% .histogram

(The special binary operators accomplish piping, currying, and composition) I also have a maxloc function, which is similar to which.max, but returns all the absolute maxima of a vector. R pseudocode for maxloc is

FUNloc <- function (FUN, x, na.rm=F)
        which(x == list(identity, rmna)[[na.rm %|% index.b]](x) %|% FUN)

maxloc <- FUNloc %<=% max

minloc <- FUNloc %<=% min # I'M THROWING IN minloc TO EXPLAIN WHY I MADE FUNloc

Then

imode <- histogram %O% maxloc

and

x %|% map %|% imode %|% unmap

will compute the mode of any collection, provided appropriate map-ping and unmap-ping functions are defined.

Upvotes: -1

GKi
GKi

Reputation: 39647

I case your observations are classes from Real numbers and you expect that the mode to be 2.5 when your observations are 2, 2, 3, and 3 then you could estimate the mode with mode = l1 + i * (f1-f0) / (2f1 - f0 - f2) where l1..lower limit of most frequent class, f1..frequency of most frequent class, f0..frequency of classes before most frequent class, f2..frequency of classes after most frequent class and i..Class interval as given e.g. in 1, 2, 3:

#Small Example
x <- c(2,2,3,3) #Observations
i <- 1          #Class interval

z <- hist(x, breaks = seq(min(x)-1.5*i, max(x)+1.5*i, i), plot=F) #Calculate frequency of classes
mf <- which.max(z$counts)   #index of most frequent class
zc <- z$counts
z$breaks[mf] + i * (zc[mf] - zc[mf-1]) / (2*zc[mf] - zc[mf-1] - zc[mf+1])  #gives you the mode of 2.5


#Larger Example
set.seed(0)
i <- 5          #Class interval
x <- round(rnorm(100,mean=100,sd=10)/i)*i #Observations

z <- hist(x, breaks = seq(min(x)-1.5*i, max(x)+1.5*i, i), plot=F)
mf <- which.max(z$counts)
zc <- z$counts
z$breaks[mf] + i * (zc[mf] - zc[mf-1]) / (2*zc[mf] - zc[mf-1] - zc[mf+1])  #gives you the mode of 99.5

In case you want the most frequent level and you have more than one most frequent level you can get all of them e.g. with:

x <- c(2,2,3,5,5)
names(which(max(table(x))==table(x)))
#"2" "5"

Upvotes: 1

Dan Houghton
Dan Houghton

Reputation: 121

This builds on jprockbelly's answer, by adding a speed up for very short vectors. This is useful when applying mode to a data.frame or datatable with lots of small groups:

Mode <- function(x) {
   if ( length(x) <= 2 ) return(x[1])
   if ( anyNA(x) ) x = x[!is.na(x)]
   ux <- unique(x)
   ux[which.max(tabulate(match(x, ux)))]
}

Upvotes: 5

Jibin
Jibin

Reputation: 333

Mode can't be useful in every situations. So the function should address this situation. Try the following function.

Mode <- function(v) {
  # checking unique numbers in the input
  uniqv <- unique(v)
  # frquency of most occured value in the input data
  m1 <- max(tabulate(match(v, uniqv)))
  n <- length(tabulate(match(v, uniqv)))
  # if all elements are same
  same_val_check <- all(diff(v) == 0)
  if(same_val_check == F){
    # frquency of second most occured value in the input data
    m2 <- sort(tabulate(match(v, uniqv)),partial=n-1)[n-1]
    if (m1 != m2) {
      # Returning the most repeated value
      mode <- uniqv[which.max(tabulate(match(v, uniqv)))]
    } else{
      mode <- "Two or more values have same frequency. So mode can't be calculated."
    }
  } else {
    # if all elements are same
    mode <- unique(v)
  }
  return(mode)
}

Output,

x1 <- c(1,2,3,3,3,4,5)
Mode(x1)
# [1] 3

x2 <- c(1,2,3,4,5)
Mode(x2)
# [1] "Two or more varibles have same frequency. So mode can't be calculated."

x3 <- c(1,1,2,3,3,4,5)
Mode(x3)
# [1] "Two or more values have same frequency. So mode can't be calculated."

Upvotes: 2

Ken Williams
Ken Williams

Reputation: 23955

One more solution, which works for both numeric & character/factor data:

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

On my dinky little machine, that can generate & find the mode of a 10M-integer vector in about half a second.

If your data set might have multiple modes, the above solution takes the same approach as which.max, and returns the first-appearing value of the set of modes. To return all modes, use this variant (from @digEmAll in the comments):

Modes <- function(x) {
  ux <- unique(x)
  tab <- tabulate(match(x, ux))
  ux[tab == max(tab)]
}

Upvotes: 524

Abhiroop Sarkar
Abhiroop Sarkar

Reputation: 2311

There are multiple solutions provided for this one. I checked the first one and after that wrote my own. Posting it here if it helps anyone:

Mode <- function(x){
  y <- data.frame(table(x))
  y[y$Freq == max(y$Freq),1]
}

Lets test it with a few example. I am taking the iris data set. Lets test with numeric data

> Mode(iris$Sepal.Length)
[1] 5

which you can verify is correct.

Now the only non numeric field in the iris dataset(Species) does not have a mode. Let's test with our own example

> test <- c("red","red","green","blue","red")
> Mode(test)
[1] red

EDIT

As mentioned in the comments, user might want to preserve the input type. In which case the mode function can be modified to:

Mode <- function(x){
  y <- data.frame(table(x))
  z <- y[y$Freq == max(y$Freq),1]
  as(as.character(z),class(x))
}

The last line of the function simply coerces the final mode value to the type of the original input.

Upvotes: 3

C8H10N4O2
C8H10N4O2

Reputation: 18995

A small modification to Ken Williams' answer, adding optional params na.rm and return_multiple.

Unlike the answers relying on names(), this answer maintains the data type of x in the returned value(s).

stat_mode <- function(x, return_multiple = TRUE, na.rm = FALSE) {
  if(na.rm){
    x <- na.omit(x)
  }
  ux <- unique(x)
  freq <- tabulate(match(x, ux))
  mode_loc <- if(return_multiple) which(freq==max(freq)) else which.max(freq)
  return(ux[mode_loc])
}

To show it works with the optional params and maintains data type:

foo <- c(2L, 2L, 3L, 4L, 4L, 5L, NA, NA)
bar <- c('mouse','mouse','dog','cat','cat','bird',NA,NA)

str(stat_mode(foo)) # int [1:3] 2 4 NA
str(stat_mode(bar)) # chr [1:3] "mouse" "cat" NA
str(stat_mode(bar, na.rm=T)) # chr [1:2] "mouse" "cat"
str(stat_mode(bar, return_mult=F, na.rm=T)) # chr "mouse"

Thanks to @Frank for simplification.

Upvotes: 10

user6764048
user6764048

Reputation: 7

An easy way to calculate MODE of a vector 'v' containing discrete values is:

names(sort(table(v)))[length(sort(table(v)))]

Upvotes: -3

Gaurav
Gaurav

Reputation: 1109

Below is the code which can be use to find the mode of a vector variable in R.

a <- table([vector])

names(a[a==max(a)])

Upvotes: 3

Ashutosh Agrahari
Ashutosh Agrahari

Reputation: 31

Calculating Mode is mostly in case of factor variable then we can use

labels(table(HouseVotes84$V1)[as.numeric(labels(max(table(HouseVotes84$V1))))])

HouseVotes84 is dataset available in 'mlbench' package.

it will give max label value. it is easier to use by inbuilt functions itself without writing function.

Upvotes: 0

Nsquare
Nsquare

Reputation: 91

This hack should work fine. Gives you the value as well as the count of mode:

Mode <- function(x){
a = table(x) # x is a vector
return(a[which.max(a)])
}

Upvotes: 6

hugovdberg
hugovdberg

Reputation: 1631

Based on @Chris's function to calculate the mode or related metrics, however using Ken Williams's method to calculate frequencies. This one provides a fix for the case of no modes at all (all elements equally frequent), and some more readable method names.

Mode <- function(x, method = "one", na.rm = FALSE) {
  x <- unlist(x)
  if (na.rm) {
    x <- x[!is.na(x)]
  }

  # Get unique values
  ux <- unique(x)
  n <- length(ux)

  # Get frequencies of all unique values
  frequencies <- tabulate(match(x, ux))
  modes <- frequencies == max(frequencies)

  # Determine number of modes
  nmodes <- sum(modes)
  nmodes <- ifelse(nmodes==n, 0L, nmodes)

  if (method %in% c("one", "mode", "") | is.na(method)) {
    # Return NA if not exactly one mode, else return the mode
    if (nmodes != 1) {
      return(NA)
    } else {
      return(ux[which(modes)])
    }
  } else if (method %in% c("n", "nmodes")) {
    # Return the number of modes
    return(nmodes)
  } else if (method %in% c("all", "modes")) {
    # Return NA if no modes exist, else return all modes
    if (nmodes > 0) {
      return(ux[which(modes)])
    } else {
      return(NA)
    }
  }
  warning("Warning: method not recognised.  Valid methods are 'one'/'mode' [default], 'n'/'nmodes' and 'all'/'modes'")
}

Since it uses Ken's method to calculate frequencies the performance is also optimised, using AkselA's post I benchmarked some of the previous answers as to show how my function is close to Ken's in performance, with the conditionals for the various ouput options causing only minor overhead: Comparison of Mode functions

Upvotes: 11

AkselA
AkselA

Reputation: 8837

I was looking through all these options and started to wonder about their relative features and performances, so I did some tests. In case anyone else are curious about the same, I'm sharing my results here.

Not wanting to bother about all the functions posted here, I chose to focus on a sample based on a few criteria: the function should work on both character, factor, logical and numeric vectors, it should deal with NAs and other problematic values appropriately, and output should be 'sensible', i.e. no numerics as character or other such silliness.

I also added a function of my own, which is based on the same rle idea as chrispy's, except adapted for more general use:

library(magrittr)

Aksel <- function(x, freq=FALSE) {
    z <- 2
    if (freq) z <- 1:2
    run <- x %>% as.vector %>% sort %>% rle %>% unclass %>% data.frame
    colnames(run) <- c("freq", "value")
    run[which(run$freq==max(run$freq)), z] %>% as.vector   
}

set.seed(2)

F <- sample(c("yes", "no", "maybe", NA), 10, replace=TRUE) %>% factor
Aksel(F)

# [1] maybe yes  

C <- sample(c("Steve", "Jane", "Jonas", "Petra"), 20, replace=TRUE)
Aksel(C, freq=TRUE)

# freq value
#    7 Steve

I ended up running five functions, on two sets of test data, through microbenchmark. The function names refer to their respective authors:

enter image description here

Chris' function was set to method="modes" and na.rm=TRUE by default to make it more comparable, but other than that the functions were used as presented here by their authors.

In matter of speed alone Kens version wins handily, but it is also the only one of these that will only report one mode, no matter how many there really are. As is often the case, there's a trade-off between speed and versatility. In method="mode", Chris' version will return a value iff there is one mode, else NA. I think that's a nice touch. I also think it's interesting how some of the functions are affected by an increased number of unique values, while others aren't nearly as much. I haven't studied the code in detail to figure out why that is, apart from eliminating logical/numeric as a the cause.

Upvotes: 2

Naimish Agarwal
Naimish Agarwal

Reputation: 516

Another possible solution:

Mode <- function(x) {
    if (is.numeric(x)) {
        x_table <- table(x)
        return(as.numeric(names(x_table)[which.max(x_table)]))
    }
}

Usage:

set.seed(100)
v <- sample(x = 1:100, size = 1000000, replace = TRUE)
system.time(Mode(v))

Output:

   user  system elapsed 
   0.32    0.00    0.31 

Upvotes: 1

Ernest S Kirubakaran
Ernest S Kirubakaran

Reputation: 1564

Here is a function to find the mode:

mode <- function(x) {
  unique_val <- unique(x)
  counts <- vector()
  for (i in 1:length(unique_val)) {
    counts[i] <- length(which(x==unique_val[i]))
  }
  position <- c(which(counts==max(counts)))
  if (mean(counts)==max(counts)) 
    mode_x <- 'Mode does not exist'
  else 
    mode_x <- unique_val[position]
  return(mode_x)
}

Upvotes: 3

Chris
Chris

Reputation: 153

The following function comes in three forms:

method = "mode" [default]: calculates the mode for a unimodal vector, else returns an NA
method = "nmodes": calculates the number of modes in the vector
method = "modes": lists all the modes for a unimodal or polymodal vector

modeav <- function (x, method = "mode", na.rm = FALSE)
{
  x <- unlist(x)
  if (na.rm)
    x <- x[!is.na(x)]
  u <- unique(x)
  n <- length(u)
  #get frequencies of each of the unique values in the vector
  frequencies <- rep(0, n)
  for (i in seq_len(n)) {
    if (is.na(u[i])) {
      frequencies[i] <- sum(is.na(x))
    }
    else {
      frequencies[i] <- sum(x == u[i], na.rm = TRUE)
    }
  }
  #mode if a unimodal vector, else NA
  if (method == "mode" | is.na(method) | method == "")
  {return(ifelse(length(frequencies[frequencies==max(frequencies)])>1,NA,u[which.max(frequencies)]))}
  #number of modes
  if(method == "nmode" | method == "nmodes")
  {return(length(frequencies[frequencies==max(frequencies)]))}
  #list of all modes
  if (method == "modes" | method == "modevalues")
  {return(u[which(frequencies==max(frequencies), arr.ind = FALSE, useNames = FALSE)])}  
  #error trap the method
  warning("Warning: method not recognised.  Valid methods are 'mode' [default], 'nmodes' and 'modes'")
  return()
}

Upvotes: 14

jprockbelly
jprockbelly

Reputation: 1607

I found Ken Williams post above to be great, I added a few lines to account for NA values and made it a function for ease.

Mode <- function(x, na.rm = FALSE) {
  if(na.rm){
    x = x[!is.na(x)]
  }

  ux <- unique(x)
  return(ux[which.max(tabulate(match(x, ux)))])
}

Upvotes: 65

RandallShanePhD
RandallShanePhD

Reputation: 5774

While I like Ken Williams simple function, I would like to retrieve the multiple modes if they exist. With that in mind, I use the following function which returns a list of the modes if multiple or the single.

rmode <- function(x) {
  x <- sort(x)  
  u <- unique(x)
  y <- lapply(u, function(y) length(x[x==y]))
  u[which( unlist(y) == max(unlist(y)) )]
} 

Upvotes: 2

Wei
Wei

Reputation: 375

Could try the following function:

  1. transform numeric values into factor
  2. use summary() to gain the frequency table
  3. return mode the index whose frequency is the largest
  4. transform factor back to numeric even there are more than 1 mode, this function works well!
mode <- function(x){
  y <- as.factor(x)
  freq <- summary(y)
  mode <- names(freq)[freq[names(freq)] == max(freq)]
  as.numeric(mode)
}

Upvotes: 0

Yollanda Beetroot
Yollanda Beetroot

Reputation: 333

I would use the density() function to identify a smoothed maximum of a (possibly continuous) distribution :

function(x) density(x, 2)$x[density(x, 2)$y == max(density(x, 2)$y)]

where x is the data collection. Pay attention to the adjust paremeter of the density function which regulate the smoothing.

Upvotes: 2

statistic1979
statistic1979

Reputation: 45

This works pretty fine

> a<-c(1,1,2,2,3,3,4,4,5)
> names(table(a))[table(a)==max(table(a))]

Upvotes: 4

girl
girl

Reputation: 19

You could also calculate the number of times an instance has happened in your set and find the max number. e.g.

> temp <- table(as.vector(x))
> names (temp)[temp==max(temp)]
[1] "1"
> as.data.frame(table(x))
r5050 Freq
1     0   13
2     1   15
3     2    6
> 

Upvotes: -1

AleRuete
AleRuete

Reputation: 313

I can't vote yet but Rasmus Bååth's answer is what I was looking for. However, I would modify it a bit allowing to contrain the distribution for example fro values only between 0 and 1.

estimate_mode <- function(x,from=min(x), to=max(x)) {
  d <- density(x, from=from, to=to)
  d$x[which.max(d$y)]
}

We aware that you may not want to constrain at all your distribution, then set from=-"BIG NUMBER", to="BIG NUMBER"

Upvotes: 10

Related Questions