watchtower
watchtower

Reputation: 4298

Descriptive statistics on grouped and ungrouped observations

I have recently migrated to R from Stata. I am unsure how to perform compute descriptive statistics on grouped and ungrouped observations.

Here's my data:

dput(DF)
structure(list(Product_Name = c("iPhone", "iPhone", "iPhone", 
"iPhone", "iPhone", "iPhone", "Nexus 6P", "Nexus 6P", "Nexus 6P", 
"Nexus 6P", "Nexus 6P", "Nexus 6P"), Product_Type = c("New", 
"New", "Refurbished", "New", "New", "Refurbished", "Refurbished", 
"Refurbished", "Refurbished", "Refurbished", "Refurbished", "Refurbished"
), Year = c(2006, 2011, 2009, 2008, 2011, 2009, 2012, 2007, 2013, 
2015, 2009, 2010), Units = c(100, 200, 300, 400, 500, 600, 700, 
200, 120, 125, 345, 340)), .Names = c("Product_Name", "Product_Type", 
"Year", "Units"), row.names = c(NA, 12L), class = "data.frame")

My data has products sold by year and type. Each product could a refurbished product or a new product. Further, if it was sold before 2010, I would mark it as sold in "Time 1" otherwise I would mark it as sold in "Time 2".

Here's my code for this:

DF[DF$Year<2010,"Time"]<-"1"
DF[DF$Year>=2010,"Time"]<-"2"

Now, I want to generate descriptive statistics for these time periods.

DF %>% 
  group_by(Product_Name, Product_Type,Time) %>%
  dplyr::summarise(Count = n(), 
                   Sum_Units = sum(Units,na.rm=TRUE), 
                   Avg_Units = mean(Units,na.rm = TRUE), 
                   Max_Units=max(Units,na.rm = TRUE))

If we run above code, we would get descriptive statistics by Product_Name, Product_Type, and Time (i.e. grouped descriptive statistics). However, this is not what I want. I want descriptive statistics with and without considering groupings with Product_Type and Time. Meaning, I would want to compute descriptive statistics assuming that products were sold in Time 1 OR Time 2 (i.e. all years) and irrespective of the type of product sold, while retaining some of the grouped information above.

Expected output:

dput(DFOut)
structure(list(Product_Name = c("iPhone", "Nexus 6P"), New_Units_Sum_Time1 = c(500, 
NA), Refurbished_Units_Sum_Time_1 = c(900, 545), Sum_Units_Time1 = c(1400, 
545), Sum_Units_Time2 = c(700, 1285), Sum_Units_Time_1_And_2 = c(2100, 
1830), Avg_Units_Time1 = c(350, 272.5), Avg_Units_Time2 = c(350, 
321.25), Avg_Units_Time_1_And_2 = c(350, 305), Max_Units_Time1 = c(600, 
345), Max_Units_Time2 = c(500, 700), Max_Units_Time_1_And_2 = c(600, 
700)), .Names = c("Product_Name", "New_Units_Sum_Time1", "Refurbished_Units_Sum_Time_1", 
"Sum_Units_Time1", "Sum_Units_Time2", "Sum_Units_Time_1_And_2", 
"Avg_Units_Time1", "Avg_Units_Time2", "Avg_Units_Time_1_And_2", 
"Max_Units_Time1", "Max_Units_Time2", "Max_Units_Time_1_And_2"
), row.names = 1:2, class = "data.frame")

In the output, you would see that I have some descriptive statistics:

a) based on the type of product and the period it was sold (e.g. New_Units_Sum_Time1 i.e. New and Time1). Please note that in the output, I have only shown the combination of New and Time1. If you can guide me how to produce descriptive statistics for other combinations of Refurbished and Time, it would be awesome.

b) based on ignoring type of product but not ignoring the period sold (e.g. Sum_Units_Time1 i.e. Time1)

c) based on ignoring both type of product and period it was sold (e.g. Sum_Units_Time_1_And_2).

Ditto for Avg and Mean.

How can I do this? I'd appreciate any help. I am really struggling on this.


Please note that I manually created DFOut using Excel. Although I triple checked it, there could be some manual errors--I will be more than happy to clarify them if there are questions. Thank you for your time.


sessionInfo()

R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] grDevices datasets  stats     graphics  grid      tcltk     utils     methods   base     

other attached packages:
 [1] tables_0.8              Hmisc_4.0-2             Formula_1.2-1           survival_2.40-1        
 [5] ResourceSelection_0.3-0 magrittr_1.5            stringr_1.1.0           bit64_0.9-5            
 [9] bit_1.1-12              tufterhandout_1.2.1     knitr_1.15.1            rmarkdown_1.3          
[13] tufte_0.2               corrplot_0.77           purrr_0.2.2             readr_1.0.0            
[17] tibble_1.2              tidyverse_1.1.1         cowplot_0.7.0           plotly_4.5.6           
[21] ggplot2_2.2.1           maps_3.1.1              directlabels_2015.12.16 tidyr_0.6.1            
[25] ggthemes_3.3.0          R2HTML_2.3.2            lubridate_1.6.0         xts_0.9-7              
[29] zoo_1.7-14              lattice_0.20-34         corrgram_1.10           hexbin_1.27.1          
[33] sm_2.2-5.4              compare_0.2-6           installr_0.18.0         psych_1.6.12           
[37] reshape2_1.4.2          readstata13_0.8.5       pastecs_1.3-18          boot_1.3-18            
[41] vcd_1.4-3               car_2.1-4               xlsxjars_0.6.1          rJava_0.9-8            
[45] debug_1.3.1             dplyr_0.5.0             foreign_0.8-67          gmodels_2.16.2         
[49] openxlsx_4.0.0          plyr_1.8.4             

loaded via a namespace (and not attached):
 [1] minqa_1.2.4         colorspace_1.3-2    class_7.3-14        modeltools_0.2-21   mclust_5.2.2       
 [6] rprojroot_1.2       htmlTable_1.9       base64enc_0.1-3     MatrixModels_0.4-1  flexmix_2.3-13     
[11] mvtnorm_1.0-5       xml2_1.1.1          codetools_0.2-15    splines_3.3.2       mnormt_1.5-5       
[16] robustbase_0.92-7   jsonlite_1.2        nloptr_1.0.4        pbkrtest_0.4-6      broom_0.4.1        
[21] cluster_2.0.5       kernlab_0.9-25      httr_1.2.1          backports_1.0.5     assertthat_0.1     
[26] Matrix_1.2-7.1      lazyeval_0.2.0      acepack_1.4.1       htmltools_0.3.5     quantreg_5.29      
[31] tools_3.3.2         gtable_0.2.0        Rcpp_0.12.9         trimcluster_0.1-2   gdata_2.17.0       
[36] nlme_3.1-128        iterators_1.0.8     fpc_2.1-10          lmtest_0.9-34       lme4_1.1-12        
[41] rvest_0.3.2         gtools_3.5.0        dendextend_1.4.0    DEoptimR_1.0-8      MASS_7.3-45        
[46] scales_0.4.1        TSP_1.1-4           hms_0.3             parallel_3.3.2      SparseM_1.74       
[51] RColorBrewer_1.1-2  gridExtra_2.2.1     rpart_4.1-10        latticeExtra_0.6-28 stringi_1.1.2      
[56] gclus_1.3.1         mvbutils_2.7.4.1    foreach_1.4.3       checkmate_1.8.2     seriation_1.2-1    
[61] caTools_1.17.1      prabclus_2.2-6      bitops_1.0-6        evaluate_0.10       htmlwidgets_0.8    
[66] R6_2.2.0            gplots_3.0.1        DBI_0.5-1           haven_1.0.0         whisker_0.3-2      
[71] mgcv_1.8-16         nnet_7.3-12         modelr_0.1.0        KernSmooth_2.23-15  viridis_0.3.4      
[76] readxl_0.1.1        data.table_1.10.0   forcats_0.2.0       digest_0.6.12       diptest_0.75-7     
[81] stats4_3.3.2        munsell_0.4.3       registry_0.3        viridisLite_0.1.3   quadprog_1.5-5    

Upvotes: 0

Views: 172

Answers (1)

Sotos
Sotos

Reputation: 51592

One way to automate this is to first create a vector (ind) with all possible combinations of your grouping variables. We then take those combinations and convert them to formula with Units. As each formula is saved in a list (l1), we iterate over that list and aggregate.

ind <- unlist(sapply(c(2,3), function(i) combn(c('Product_Name', 'Product_Type', 'Time'), 
                                                             i, paste, collapse = '+')))

l1 <- sapply(ind, function(i) as.formula(paste('Units ~ ', i)))

lapply(l1, function(i) aggregate(i, df, FUN = function(j) c(sum1 = sum(j), 
                                                            avg = mean(j), 
                                                            max_units = max(j))))

#which gives

#$`Product_Name+Product_Type`
#  Product_Name Product_Type Units.sum1 Units.avg Units.max_units
#1       iPhone          New       1200       300             500
#2       iPhone  Refurbished        900       450             600
#3     Nexus 6P  Refurbished       1830       305             700

#$`Product_Name+Time`
#  Product_Name Time Units.sum1 Units.avg Units.max_units
#1       iPhone    1    1400.00    350.00          600.00
#2     Nexus 6P    1     545.00    272.50          345.00
#3       iPhone    2     700.00    350.00          500.00
#4     Nexus 6P    2    1285.00    321.25          700.00

#$`Product_Type+Time`
#  Product_Type Time Units.sum1 Units.avg Units.max_units
#1          New    1     500.00    250.00          400.00
#2  Refurbished    1    1445.00    361.25          600.00
#3          New    2     700.00    350.00          500.00
#4  Refurbished    2    1285.00    321.25          700.00

#$`Product_Name+Product_Type+Time`
#  Product_Name Product_Type Time Units.sum1 Units.avg Units.max_units
#1       iPhone          New    1     500.00    250.00          400.00
#2       iPhone  Refurbished    1     900.00    450.00          600.00
#3     Nexus 6P  Refurbished    1     545.00    272.50          345.00
#4       iPhone          New    2     700.00    350.00          500.00
#5     Nexus 6P  Refurbished    2    1285.00    321.25          700.00

Upvotes: 2

Related Questions