Reputation: 1575
I have large dataframe. I want to find the row index of the n
lowest element of some column.
For ex : Consider following dataframe df
col_1 col_2 col_3
1 2 3
-1 2 21
2 3 1
So func(dataframe = df, column_name = col_1, n=2)
will return me
[1,2] #index of the rows
NOTE : I want to avoid sorting the column.
Upvotes: 1
Views: 942
Reputation: 60472
An interesting question. I can think of (at least) four methods; all using base R solutions. Instead of working with a data frame, for simplicity I'm just creating a vector. If it works on a vector, just subset the data frame.
First some dummy data
x = runif(1e6)
Now the four methods (in order of speed)
## Using partial sorting
f = function(n){
cut_off = sort(x, partial=n+1)[n+1]
x[x < cut_off]
}
## Using a faster method of sorting; but doesn't work with partial
g = function(n){
cut_off = sort(x, method="radix")[n+1]
x[x < cut_off]
}
# Ordering
h = function(n) x[order(x)[1:n]]
#Ranking
i = function(n) x[rank(x) %in% 1:n]
Timings indicate, careful sorting seems optimal.
R> microbenchmark::microbenchmark(f(n), g(n), h(n),i(n), times = 4)
Unit: milliseconds
expr min lq mean median uq max neval cld
f(n) 112.8 116.0 122.1 122.6 128.1 130.2 4 a
g(n) 372.6 379.1 442.6 386.1 506.1 625.6 4 b
h(n) 1162.3 1196.0 1222.0 1238.4 1248.0 1248.8 4 c
i(n) 1414.9 1437.9 1489.1 1484.4 1540.3 1572.6 4 d
To work with data frames you would have something like:
cut_off = sort(df$col, partial=n+1)[n+1]
df[df$col < cut_off,]
Upvotes: 1
Reputation: 347
Using dplyr
and (for easier code) magrittr
:
data(iris) # use iris dataset
library(dplyr); library(magrittr) # load packages
iris %>%
filter(Sepal.Length %in% sort(Sepal.Length)[1:3])
This outputs the rows with the lowest 3 Sepal.Length
values without sorting the data frame. In this case there are ties, so it outputs four rows.
To get the corresponding row names, you can use something like this:
rownames(subset(iris,
Sepal.Length %in% sort(Sepal.Length)[1:3]))
Upvotes: 0
Reputation: 10506
Uses ordering, but here is one approach.
set.seed(1)
nr = 100
nc = 10
n = 5
ixCol = 1
input = matrix(runif(nr*nc),nrow = nr,ncol=nc)
input[head(order(input[,ixCol]),n),]
Upvotes: 0