Martin
Martin

Reputation: 40573

Summarize and Rank Data Frame

Using R, I need to build a report of the top 2 employees per departments with the most expenses and add an "Others" for the other employees of the department. For instances, I would need a report similar to this.

Dept.      EmployeeId     Expense
Marketing       12345         100
Marketing       12346          90
Marketing      Others         200
Sales           12347          50 <-- There's just one employee with expenses
Research        12348        2000
Research        12349         900
Research       Others       10000

In other words, I need to summarize the data with a focus on the top 2 employees with the most expenses. The sum of the expense column should be the total amount of the expenses of the company.

employeIds <- sample(1000:9999, 20)
depts <- sample(c('Sales', 'Marketing', 'Research'), 20, replace = TRUE)
expenses <- sample(1:1000, 20, replace = TRUE)

df <- data.frame(employeIds, depts, expenses)

# Based on that data, how do I build a table with the top 2 employees with the most expenses in each department, including an "Other" employee per department.

I am new to R and I am not sure on how to approach this. In SQL I would have been able to use the RANK() function and JOIN but it isn't an option here.

Upvotes: 2

Views: 221

Answers (4)

Matt Dowle
Matt Dowle

Reputation: 59602

Another data.table approach (which might be closer to the SQL style you know) :

dt <- data.table(employeIds, depts, expenses)
dt[, rank:=rank(-expenses), by=depts][,
    list("Expenses"=sum(expenses)),
    keyby=list(depts, "Employee"=ifelse(rank<=2,employeIds,"Other"))
]
       depts Employee Expenses
1: Marketing     6988      986
2: Marketing     7011      940
3: Marketing    Other     2614
4:  Research     2434      763
5:  Research     9852      731
6:  Research    Other     3397
7:     Sales     3120      581
8:     Sales     6069      868

Upvotes: 2

Arun
Arun

Reputation: 118779

Here's a data.table solution:

Creating data: I've also made cases where "Others" won't happen (the number of entries for that dept is: 1 <= entries <= 2)

set.seed(45)
employeIds <- sample(1000:9999, 20)
depts <- sample(c('Sales', 'Marketing', 'Research'), 20, replace = TRUE)
expenses <- sample(1:1000, 20, replace = TRUE)

df <- data.frame(employeIds, depts, expenses)
df <- df[-c(6,10,12,18,19), ]

A data.table solution:

require(data.table)
dt <- data.table(df, key=c("depts", "expenses"))
k <- 2
dt[, if(.N > k) {
        idx <- (seq_len(.N)-1) %/% max(k, (.N - k)) == 1
        list(EmployeeIds = c(employeIds[idx], "Others"), 
           Expenses = c(expenses[idx], sum(expenses[!idx])))
     } else {
        list(EmployeeIds = as.character(employeIds), Expenses = expenses)
     }, by = depts]

#        depts EmployeeIds Expenses
# 1: Marketing        4870      567
# 2: Marketing        3167      591
# 3: Marketing      Others     2285
# 4:  Research        5989      878
# 5:  Research        9667      930
# 6:  Research      Others     1301
# 7:     Sales        6700      129
# 8:     Sales        3857      714

Idea: The first step of creating dt with key = depts, expenses ensures that expenses is sorted in increasing order. Then, depending on the number of entries per dept, we either create an "Others" entry or not.

Upvotes: 4

Rcoster
Rcoster

Reputation: 3210

May not be the most elegant, but it's a solution:

func <- function(data) {
 data1 <- aggregate(data$expenses, list(employeIds=data$employeIds), sum)
 # rank without ties.method = "first" will screw things up with identical values
 data1$employeIds[!(rank(data1$x, ties.method="first") %in% 1:2)] <- 'Others'
 data1 <- aggregate(data.frame(expenses=data1$x), list(employeIds=data1$employeIds), sum)
}

do.call(rbind, by(df, df$depts, func))

Upvotes: 2

user1600826
user1600826

Reputation:

df <- split(df, df$depts)
df <- lapply(df, FUN=function(x){
  x <- x[order(x$expenses, decreasing=TRUE), ]
  x$total.expenses <- sum(x$expenses)
  x$group <- 1:nrow(x)
  x$group <- ifelse(x$group <= 2, x$group, "Other")
  x
})
df <- do.call(rbind, df)

Upvotes: 1

Related Questions