Reputation: 923
I have a dataset which includes seller_ID, product_ID and year the product was sold, and I am trying to find the year that one seller had maximum sold product and the specific number of sold in that year for each individual seller. Here is an example of data
seller_ID <- c(1,1,1,2,2,3,4,4,4,4,4)
Product_ID <- c(1000,1000,1005,1004,1005,1003,1010,
1000,1001,1019,1017)
year <- c(2015,2016,2015,2020,2020,2000,2000,2001,2001,2001,2005)
data<- data.frame(seller_ID,Product_ID,year)
seller_ID Product_ID year
1 1 1000 2015
2 1 1000 2016
3 1 1005 2015
4 2 1004 2020
5 2 1005 2020
6 3 1003 2000
7 4 1010 2000
8 4 1000 2001
9 4 1001 2001
10 4 1019 2001
11 4 1017 2005
so the ideal result would be:
seller_ID Max_sold_num_year Max_year
1 1 2 2015
2 2 2 2020
3 3 1 2000
4 4 3 2001
I tried the approach I explained below and it worked ...
df_temp<- data %>%
group_by(seller_ID, year) %>%
summarize(Sold_in_Year=length(Product_ID))
unique_seller=unique(data$seller_ID)
ID_list=c()
Max_list=c()
Max_Sold_Year=c()
j=1
for (ID in unique_seller) {
df_temp_2<- subset(df_temp, df_temp$seller_ID==ID)
Max_year<- subset(df_temp_2,df_temp_2$Sold_in_Year==max(df_temp_2$Sold_in_Year))
if (nrow(Max_year)>1){
ID_list[j]<-Max_year[1,1]
Max_Sold_Year[j]<-Max_year[1,2]
Max_list[j]<-Max_year[1,3]
j<-j+1
}
else {
ID_list[j]<-Max_year[1,1]
Max_Sold_Year[j]<-Max_year[1,2]
Max_list[j]<-Max_year[1,3]
j<-j+1
}
}
#changing above list to dataframe
mm=length(ID_list)
df_test_list<- data.frame(seller_ID=numeric(mm), Max_sold_num_year=numeric(mm),Max_year=numeric(mm))
for (i in 1:mm){
df_test_list$seller_ID[i] <- ID_list[[i]]
df_test_list$Max_sold_num_year[i] <- Max_list[[i]]
df_test_list$Max_year[i] <- Max_Sold_Year[[i]]
}
however, due to subsetting each time and using for loop this approach is kind of slow for a large dataset. Do you have any suggestions on how I can improve my code? is there any other way that I can calculate the desired result without using for loop?
Thanks
Upvotes: 1
Views: 44
Reputation: 125687
Try this
library(dplyr)
seller_ID <- c(1,1,1,2,2,3,4,4,4,4,4)
Product_ID <- c(1000,1000,1005,1004,1005,1003,1010,
1000,1001,1019,1017)
year <- c(2015,2016,2015,2020,2020,2000,2000,2001,2001,2001,2005)
data<- data.frame(seller_ID,Product_ID,year)
data %>%
dplyr::count(seller_ID, year) %>%
dplyr::group_by(seller_ID) %>%
dplyr::filter(n == max(n)) %>%
dplyr::rename(Max_sold_num_year = n, Max_year = year)
#> # A tibble: 4 x 3
#> # Groups: seller_ID [4]
#> seller_ID Max_year Max_sold_num_year
#> <dbl> <dbl> <int>
#> 1 1 2015 2
#> 2 2 2020 2
#> 3 3 2000 1
#> 4 4 2001 3
And thanks to the comment by @yung_febreze this could be achieved even shorter with
data %>%
dplyr::count(seller_ID, year) %>%
dplyr::group_by(seller_ID) %>%
dplyr::top_n(1)
EDIT In case of duplicated maximum values one can add dplyr::top_n(1, wt = year)
which filters for the latest (or maximum) year:
data %>%
dplyr::count(seller_ID, year) %>%
dplyr::group_by(seller_ID) %>%
dplyr::top_n(1, wt = n) %>%
dplyr::top_n(1, wt = year) %>%
dplyr::rename(Max_sold_num_year = n, Max_year = year)
Upvotes: 1