Fiona
Fiona

Reputation: 477

Select the n most frequent values in a variable

I would like to find the most common values in a column in a data frame. I assume using table would be the best way to do this? I then want to filter/subset my data frame to only include these top-n values.

An example of my data frame is as follows. Here I want to find e.g. the top 2 IDs.

ID    col
A     blue
A     purple
A     green
B     green
B     red
C     red
C     blue
C     yellow
C     orange

I therefore want to output the following:

Top 2 values of ID are:
A and C

I will then select the rows corresponding to ID A and C:

ID    col
A     blue
A     purple
A     green
C     red
C     blue
C     yellow
C     orange

Upvotes: 3

Views: 7056

Answers (3)

mdag02
mdag02

Reputation: 1165

With the tidyverse and its top_n :

library(tidyverse)
d %>%
  group_by(ID) %>%
  summarise(n()) %>%
  top_n(2)

Selecting by n()
# A tibble: 2 x 2
ID    `n()`
<fct> <int>
1 A         3
2 C         4

To complete with the subset :

d %>%
  group_by(ID) %>%
  summarise(n()) %>%
  top_n(2) %>% 
  { filter(d, ID %in% .$ID) }

Selecting by n()
ID    col
1  A   blue
2  A purple
3  A  green
4  C    red
5  C   blue
6  C yellow
7  C orange

(we use the braces because we don't feed the left hand side result as the first argument of the filter)

Upvotes: 1

Roman
Roman

Reputation: 17648

You can try a tidyverse. Add the counts of ID's, then filter for the top two (using < 3) or top ten (using < 11):

library(tidyverse)
d %>% 
  add_count(ID) %>% 
  filter(dense_rank(-n) < 3)
# A tibble: 7 x 3
  ID    col        n
  <fct> <fct>  <int>
1 A     blue       3
2 A     purple     3
3 A     green      3
4 C     red        4
5 C     blue       4
6 C     yellow     4
7 C     orange     4

Data

d <- read.table(text="ID    col
A     blue
                A     purple
                A     green
                B     green
                B     red
                C     red
                C     blue
                C     yellow
                C     orange", header=T)

Upvotes: 4

Ronak Shah
Ronak Shah

Reputation: 388817

We can count the number of values using table, sort them in decreasing order and select first 2 (or 10) values, get the corresponding ID's and subset those ID's from the data frame.

df[df$ID %in% names(sort(table(df$ID), decreasing = TRUE)[1:2]), ]

#  ID    col
#1  A   blue
#2  A purple
#3  A  green
#6  C    red
#7  C   blue
#8  C yellow
#9  C orange

Upvotes: 2

Related Questions