user11418708
user11418708

Reputation: 902

Most frequent element per column

I have the following matrix:

set.seed(3690)

example = matrix(sample(1:10, 100, replace = TRUE), nrow = 10)

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    4    4    2    7    2    2    3    8    2     5
 [2,]    7    3    2    6    6    5    7    8    1     3
 [3,]    7    5    7    9    4    9    4    8    2     7
 [4,]    5    3    4    2    1    5    9   10    9     5
 [5,]    9   10    7    2    7    4    9    1    1     9
 [6,]    2    3    5    1    2    8    1    5    9     4
 [7,]    5    4   10    5    9   10    1    6    1    10
 [8,]    6    3    9    7    1    1    9    2    1     7
 [9,]    5    9    4    8    9    9    5   10    5     4
[10,]   10    1    4    7    3    2    3    5    4     5

How can I find in R the most top 10 (or top 5) frequently occurring elements per column?

This is how I have coded this in Stata:

tempvar freq
generate byte `freq'=1

sort serial t0400_0415_d1-t0345_0400_d7

collapse (count) `freq' serial,  by(t0400_0415_d1-t0345_0400_d7) 
list, sepby(`freq')

gsort -`freq' t0400_0415_d1-t0345_0400_d7
generate rank=_n
keep if rank<=20
drop `freq'

sort  t0400_0415_d1-t0345_0400_d7
tempfile top20 
save `"`top20'"'

sort rank t0400_0415_d1-t0345_0400_d7 
list rank t0400_0415_d1-t0345_0400_d7 

Note that t0400_0415_d1 - t0345_0400_d7 are variable names.

Upvotes: 2

Views: 335

Answers (3)

M--
M--

Reputation: 28825

It can be done in base like this:

 sapply(1:ncol(example), function(x) rev(tail(names(sort(table(example[,x]))), 2)))

And if you want to know the frequencies then just ignore names():

sapply(1:ncol(example), function(x) rev(tail(sort(table(example[,x])), 2)))

Upvotes: 3

chan1142
chan1142

Reputation: 643

Using base package:

set.seed(1)
example <- matrix(sample(101:110, 500, replace = TRUE), nrow = 50)
# changed 1:10 to 101:110; changed 100 to 500 and nrow = 10 to 50

mostFreqVals <- function(x,k) {
    tbl <- table(x)
    as.integer(names(tbl)[order(-tbl)][1:k])
}
apply(example, 2, mostFreqVals, k=3)  # change k to 5, 10 or whatever
# 1st column is c(108,107,104)

You can verify the above codes manually.

# -- Verify the first column --
table(example[,1])
# 101 102 103 104 105 106 107 108 109 110 
#   3   4   5   6   5   4   7   8   4   4 
# Frequency order: 108, 107, 104, (103, 105), ...
# You need tie-breaking.

Upvotes: 1

tmfmnk
tmfmnk

Reputation: 39858

One tidyverse possibility could be:

example %>%
 data.frame() %>%
 gather(var, val) %>%
 count(var, val) %>%
 arrange(var, desc(n)) %>%
 group_by(var) %>%
 slice(1:5)

   var     val     n
   <chr> <int> <int>
 1 X1       10     3
 2 X1        6     2
 3 X1        7     2
 4 X1        2     1
 5 X1        3     1
 6 X10       6     2
 7 X10      10     2
 8 X10       1     1
 9 X10       2     1
10 X10       4     1

With slice(), you can choose the top n (here it is top 5) most frequently occurring elements per column.

Or if you want the top n most frequently occurring elements in all of the columns:

example %>%
 data.frame() %>%
 gather(var, val) %>%
 count(val) %>%
 arrange(desc(n)) %>%
 slice(1:5)

    val     n
  <int> <int>
1     5    15
2     2    13
3     4    11
4     7    11
5     8    11

Upvotes: 2

Related Questions