Ranjan Pandey
Ranjan Pandey

Reputation: 85

How do I use Group By with order function in R

I have a dataframe with 120000 records and 19 variables, 2 of which are state and MonthlyIncome.

I have to create a new dataset with top 10(in terms of monthly income) customers from each state.

I tried a lot of options including data.table and dplyr and base but there is always something missing.

data.table :

 x <- customer_any_360[,order(-dense_rank(MonthlyIncome))[1:10], by = state]

--- example I tried

Could someone please help, am still new to R and really struggling with this problem. Thanks in advance!!

Upvotes: 4

Views: 7239

Answers (3)

Ken Benoit
Ken Benoit

Reputation: 14902

You're on the right track by seeking the answer in the amazing data.table package. Here I just made up some data for states 1 through 50, and drew incomes from a normal distribution N(50000, 20000^2).

with edits based on comments from @Arun and request for all columns from OP (using the .SD hidden variable):

require(data.table)
set.seed(123)
mydata <- data.table(state = 1:50, 
                     monthlyIncome = round(rnorm(120000, 50000, 20000)),
                     anothervar = 1:120000)
selecteddata <- mydata[order(-monthlyIncome), head(.SD, 10), by = state]

# just to verify
selecteddata <- selecteddata[order(state, -monthlyIncome)]
mydata <- mydata[order(-monthlyIncome)]
identical(selecteddata[1:10], mydata[state==1][1:10])  # state 1
identical(selecteddata[11:20], mydata[state==2][1:10]) # state 2

Upvotes: 2

akrun
akrun

Reputation: 887118

If you want to use rank functions, one option is frank from data.table where you can specify the option in ties.method.

library(data.table)#v1.9.5+
setDT(customer_any_360)[, .SD[frank(-MonthlyIncome, 
               ties.method='dense') %in% 1:10], by = state]

Or even just order is sufficient

setDT(customer_any_360)[order(-MonthlyIncome), .SD[1:10], by = state]

Using dplyr, there are mutliple options, dense_rank, min_rank, top_n depending on what you wanted. Also, for filtering, slice or filter can be used.

library(dplyr)
customer_any_360 %>%
           group_by(state) %>%
           slice(dense_rank(-MonthlyIncome)[1:10])

Or using sqldf

 library(sqldf)
 sqldf('select * from customer_any_360 i
          where rowid in 
          (select rowid from customer_any_360 
           where state = i.state 
           order by MonthlyIncome desc 
           limit 10)
  order by i.state, i.MonthlyIncome desc')

Or using ave from base R

indx <- with(customer_any_360, ave(-MonthlyIncome, state,
       FUN=function(x) rank(x, ties.method='first')) %in% 1:10)
customer_any_360[indx,]

EDIT: The frank option was edited with suggestions from @Arun

data

set.seed(24)
customer_any_360 <- data.frame(cust=1:120000, state= sample(state.abb,
  120000, replace=TRUE), MonthlyIncome= sample(1000:6500, 120000, 
     replace=TRUE), stringsAsFactors=FALSE)

Upvotes: 8

Joswin K J
Joswin K J

Reputation: 710

Use ddply from plyr package:

data(iris)
ddply(iris, "Species", function(x) head(x[order(x$Sepal.Length, decreasing = TRUE) , ], 2))
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          5.8         4.0          1.2         0.2     setosa
2          5.7         4.4          1.5         0.4     setosa
3          7.0         3.2          4.7         1.4 versicolor
4          6.9         3.1          4.9         1.5 versicolor
5          7.9         3.8          6.4         2.0  virginica
6          7.7         3.8          6.7         2.2  virginica

Upvotes: 3

Related Questions