Yellow_truffle
Yellow_truffle

Reputation: 923

Using dplyr and group_by to calculate number of repetition for a value

I have a dataset which includes seller_ID, product_ID and year the product was sold, and I am trying to find the year that one seller had maximum sold product and the specific number of sold in that year for each individual seller. Here is an example of data

seller_ID <- c(1,1,1,2,2,3,4,4,4,4,4)
Product_ID <- c(1000,1000,1005,1004,1005,1003,1010,
                1000,1001,1019,1017)
year <- c(2015,2016,2015,2020,2020,2000,2000,2001,2001,2001,2005)

data<- data.frame(seller_ID,Product_ID,year)

       seller_ID Product_ID year
1          1       1000 2015
2          1       1000 2016
3          1       1005 2015
4          2       1004 2020
5          2       1005 2020
6          3       1003 2000
7          4       1010 2000
8          4       1000 2001
9          4       1001 2001
10         4       1019 2001
11         4       1017 2005

so the ideal result would be:

    seller_ID   Max_sold_num_year     Max_year
1         1                 2           2015
2         2                 2           2020
3         3                 1           2000
4         4                 3           2001

I tried the approach I explained below and it worked ...

df_temp<- data %>% 
  group_by(seller_ID, year) %>% 
  summarize(Sold_in_Year=length(Product_ID))


unique_seller=unique(data$seller_ID)


ID_list=c()
Max_list=c()
Max_Sold_Year=c()
j=1

for (ID in unique_seller) {
  df_temp_2<- subset(df_temp, df_temp$seller_ID==ID)
  Max_year<- subset(df_temp_2,df_temp_2$Sold_in_Year==max(df_temp_2$Sold_in_Year))
  
  if (nrow(Max_year)>1){
    
    ID_list[j]<-Max_year[1,1]
    Max_Sold_Year[j]<-Max_year[1,2]
    Max_list[j]<-Max_year[1,3]
    j<-j+1
  }
  
  else {
    
    ID_list[j]<-Max_year[1,1]
    Max_Sold_Year[j]<-Max_year[1,2]
    Max_list[j]<-Max_year[1,3]
    j<-j+1
  }
}
#changing above list to dataframe

mm=length(ID_list)
df_test_list<- data.frame(seller_ID=numeric(mm), Max_sold_num_year=numeric(mm),Max_year=numeric(mm))

for (i in 1:mm){
  
  df_test_list$seller_ID[i] <- ID_list[[i]]
  df_test_list$Max_sold_num_year[i] <- Max_list[[i]]
  df_test_list$Max_year[i] <- Max_Sold_Year[[i]]
}

however, due to subsetting each time and using for loop this approach is kind of slow for a large dataset. Do you have any suggestions on how I can improve my code? is there any other way that I can calculate the desired result without using for loop?

Thanks

Upvotes: 1

Views: 44

Answers (1)

stefan
stefan

Reputation: 125687

Try this

library(dplyr)

seller_ID <- c(1,1,1,2,2,3,4,4,4,4,4)
Product_ID <- c(1000,1000,1005,1004,1005,1003,1010,
                1000,1001,1019,1017)
year <- c(2015,2016,2015,2020,2020,2000,2000,2001,2001,2001,2005)

data<- data.frame(seller_ID,Product_ID,year)

data %>% 
  dplyr::count(seller_ID, year) %>% 
  dplyr::group_by(seller_ID) %>% 
  dplyr::filter(n == max(n)) %>% 
  dplyr::rename(Max_sold_num_year = n, Max_year = year)
#> # A tibble: 4 x 3
#> # Groups:   seller_ID [4]
#>   seller_ID Max_year Max_sold_num_year
#>       <dbl>    <dbl>             <int>
#> 1         1     2015                 2
#> 2         2     2020                 2
#> 3         3     2000                 1
#> 4         4     2001                 3

And thanks to the comment by @yung_febreze this could be achieved even shorter with

data %>% 
  dplyr::count(seller_ID, year) %>% 
  dplyr::group_by(seller_ID) %>%
  dplyr::top_n(1)

EDIT In case of duplicated maximum values one can add dplyr::top_n(1, wt = year) which filters for the latest (or maximum) year:

data %>% 
  dplyr::count(seller_ID, year) %>% 
  dplyr::group_by(seller_ID) %>% 
  dplyr::top_n(1, wt = n) %>% 
  dplyr::top_n(1, wt = year) %>% 
  dplyr::rename(Max_sold_num_year = n, Max_year = year)

Upvotes: 1

Related Questions