how to select top n values from a data frame retaining the duplicates in r

let's say my sample data looks as below.

   id freq
    1    4
    2    3
    3    2
    4    2
    5    1

freq column tells the frequency of each id. The question is: I want the top 3 frequencies. the output should be..

   id freq
    1    4
    2    3
    3    2
    4    2

I used the following code.

d$rank <- rank(-d$freq,ties.method="min")

where d is my data frame. I used rank command so that i can later select top 3 frequencies. The output i got is:

id freq rank
 1    4    1
 2    3    2
 3    2    3
 4    2    3
 5    1    5

The problem is rank 4 is missing. I want continuous ranks to handle many duplicated values in my original data frame. Any help is appreciated.

Thanks.

Upvotes: 0

Views: 1931

Answers (2)

akrun
akrun

Reputation: 886998

Assuming that the 'freq' is ordered in descending, we get the unique elements of 'freq', select the first 3 with head, use %in% to get the logical index of those elements that in the 'freq' column, and subset the rows.

subset(df1, freq %in% head(unique(freq),3))
#  id freq
#1  1    4
#2  2    3
#3  3    2
#4  4    2

If we are using rank, then dense_rank from dplyr will be an option

library(dplyr)
df1 %>%
    filter(dense_rank(-freq) < 4)

Or another option using frank from data.table (contributed by @David Arenburg),

library(data.table)
setDT(df)[, .SD[frank(-freq, ties.method = "dense") < 4]]

Upvotes: 1

talat
talat

Reputation: 70256

Here's another base R approach:

df[cumsum(!duplicated(df$freq))<4,]
#  id freq
#1  1    4
#2  2    3
#3  3    2
#4  4    2

This assumes the data is already in descending order (as in the example).

In case you're going to use external libraries like dplyr, I'd suggest using top_n:

library(dplyr)
top_n(df, 3, freq)
#  id freq
#1  1    4
#2  2    3
#3  3    2
#4  4    2

Upvotes: 2

Related Questions