Reputation: 85
I have a dataframe with 120000 records and 19 variables, 2 of which are state and MonthlyIncome.
I have to create a new dataset with top 10(in terms of monthly income) customers from each state.
I tried a lot of options including data.table and dplyr and base but there is always something missing.
data.table :
x <- customer_any_360[,order(-dense_rank(MonthlyIncome))[1:10], by = state]
--- example I tried
Could someone please help, am still new to R and really struggling with this problem. Thanks in advance!!
Upvotes: 4
Views: 7239
Reputation: 14902
You're on the right track by seeking the answer in the amazing data.table
package. Here I just made up some data for states 1 through 50, and drew incomes from a normal distribution N(50000, 20000^2).
with edits based on comments from @Arun and request for all columns from OP (using the .SD hidden variable):
require(data.table)
set.seed(123)
mydata <- data.table(state = 1:50,
monthlyIncome = round(rnorm(120000, 50000, 20000)),
anothervar = 1:120000)
selecteddata <- mydata[order(-monthlyIncome), head(.SD, 10), by = state]
# just to verify
selecteddata <- selecteddata[order(state, -monthlyIncome)]
mydata <- mydata[order(-monthlyIncome)]
identical(selecteddata[1:10], mydata[state==1][1:10]) # state 1
identical(selecteddata[11:20], mydata[state==2][1:10]) # state 2
Upvotes: 2
Reputation: 887118
If you want to use rank
functions, one option is frank
from data.table
where you can specify the option in ties.method
.
library(data.table)#v1.9.5+
setDT(customer_any_360)[, .SD[frank(-MonthlyIncome,
ties.method='dense') %in% 1:10], by = state]
Or even just order
is sufficient
setDT(customer_any_360)[order(-MonthlyIncome), .SD[1:10], by = state]
Using dplyr
, there are mutliple options, dense_rank
, min_rank
, top_n
depending on what you wanted. Also, for filtering, slice
or filter
can be used.
library(dplyr)
customer_any_360 %>%
group_by(state) %>%
slice(dense_rank(-MonthlyIncome)[1:10])
Or using sqldf
library(sqldf)
sqldf('select * from customer_any_360 i
where rowid in
(select rowid from customer_any_360
where state = i.state
order by MonthlyIncome desc
limit 10)
order by i.state, i.MonthlyIncome desc')
Or using ave
from base R
indx <- with(customer_any_360, ave(-MonthlyIncome, state,
FUN=function(x) rank(x, ties.method='first')) %in% 1:10)
customer_any_360[indx,]
EDIT: The frank
option was edited with suggestions from @Arun
set.seed(24)
customer_any_360 <- data.frame(cust=1:120000, state= sample(state.abb,
120000, replace=TRUE), MonthlyIncome= sample(1000:6500, 120000,
replace=TRUE), stringsAsFactors=FALSE)
Upvotes: 8
Reputation: 710
Use ddply from plyr package:
data(iris)
ddply(iris, "Species", function(x) head(x[order(x$Sepal.Length, decreasing = TRUE) , ], 2))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.8 4.0 1.2 0.2 setosa
2 5.7 4.4 1.5 0.4 setosa
3 7.0 3.2 4.7 1.4 versicolor
4 6.9 3.1 4.9 1.5 versicolor
5 7.9 3.8 6.4 2.0 virginica
6 7.7 3.8 6.7 2.2 virginica
Upvotes: 3