Hindol Ganguly
Hindol Ganguly

Reputation: 363

Find mean for sorted top n transactions

Let my dataframe has 2 columns, customer id & transaction amount. Now for each unique customer id i would like to find the transaction amount(sorted in descending order) & then from the sorted column i will find the mean transaction amount for top three transactions of the sorted list.

Cust_id     trans_amount
12345          100      
12345          200      
12345          170      
12345          300      
12345          250
12456          140        
12456          240       
12456          160       
12456          100          

The format i am looking for is,

Cust_id     trans_amount
12345          300               
12345          250      
12345          200      
12345          170      
12345          100
12456          240        
12456          160       
12456          140       
12456          100          

and from there the mean for top 3, ie,

Cust_id    mean_for_top_3
12345         250
12456         180

for the intermediate part,I have tried,

ddply(cust_data,.(cust_id.),summarize,sorted_amount=sort(trans_amount,,decreasing=TRUE))

but not getting the result. Please advise how can i reach to my desired output.

Upvotes: 0

Views: 59

Answers (2)

amrrs
amrrs

Reputation: 6325

An idiomatic solution using dplyr

df <- read.table(text = "Cust_id     trans_amount
12345          100      
                 12345          200      
                 12345          170      
                 12345          300      
                 12345          250
                 12456          140        
                 12456          240       
                 12456          160       
                 12456          100    ", header  = T)


library(dplyr)

df %>% group_by(Cust_id) %>% 
  arrange(desc(trans_amount), .by_group = T) %>%
  top_n(n = 3) %>%
  summarize(mean = mean(trans_amount))

# A tibble: 2 x 2
  Cust_id  mean
    <int> <dbl>
1   12345   250
2   12456   180

Alternative with count:

  > df %>% group_by(Cust_id) %>% 
+   #arrange(desc(trans_amount), .by_group = T) %>% 
+   mutate(count = n()) %>%
+   top_n(n = 3, wt = trans_amount) %>%
+   mutate(mean = mean(trans_amount)) %>%
+   select(Cust_id,mean,count) %>% distinct()
# A tibble: 2 x 3
# Groups:   Cust_id [2]
  Cust_id  mean count
    <int> <dbl> <int>
1   12345   250     5
2   12456   180     4
> 

Upvotes: 1

pogibas
pogibas

Reputation: 28339

Solution using data.table:

library(data.table)
setDT(cust_data)
cust_data_sort <- cust_data[, .(trans_amount = sort(trans_amount, decreasing = TRUE)), Cust_id]
cust_data_sort[, .(mean_for_top_3 = mean(head(trans_amount, 3))), Cust_id]
   Cust_id mean_for_top_3
1:   12345            250
2:   12456            180

If you don't need sorted table cust_data_sort, then you can do get mean using this:

cust_data[, .(mean_for_top_3 = mean(head(sort(trans_amount, decreasing = TRUE), 3))), Cust_id]

Upvotes: 3

Related Questions