Distilling summary statistics by numerical categories with dplyr

Question

I have a large (rows > 200000) data frame with dozens of columns of data. I want to distill this data frame down and summarize the number of data that have variables that fall within given ranges.

For instance, if I have a data.frame that is similar to this:

 set.seed(10)
 df <- data.frame( age = runif( n = 1000, min = 0, max = 4000 ),
                   size = rnorm( n = 1000, mean = 10, sd = 1 ),
                   shape = rnorm( n = 1000, mean = 1000, sd = 1000) )

and I would like to group get the number of samples within a series of age ranges, the mean size and shape, and the median size and shape from the samples in each of those age brackets.

Something like

 summary.df <- data.frame( age.group = seq( 0, 3900, by = 100 ),
                           number = (number of samples in age bin),
                           mean = ( mean of data in age bin ) )

etc.

Right now I am doing this very bluntly by creating a new data.frame for each age group.

data.1          <- subset( df, age > 0 & age <= 100 )
data.2          <- subset( df, age > 100 & age <= 200 )
data.3          <- subset( df, age > 200 & age <= 300 )

etc. and then adding a categorical variable

data.1 <- data.frame( data.1, age.group = "100", count.row = nrow( data.1 ) )
data.2 <- data.frame( data.2, age.group = "200", count.row = nrow( data.2 ) )
data.3 <- data.frame( data.3, age.group = "300", count.row = nrow( data.3 ) )

adding them together

data.big <- rbind( data.1, data.2, data.3 )

and then generating summary stats via dplyr

data.summary <- data.big %>%
   group_by( age.group ) %>%
   summarize( count.row = mean( count.row ),
         mean = mean( size, na.rm = TRUE ),
         median = median( size, na.rm = T ) )

How would I go about doing this more efficiently with just dplyr? I think there must be a way but I can't wrap my head around it.

Thanks for any help you can give!

Ronak Shah · Accepted Answer

You can make use of cut to divide the data in intervals of 100 and calculate summary statistics for each group.

library(dplyr)

df %>%
  group_by(age = cut(age, seq( 0, 4000, by = 100))) %>%
  summarise(mean = mean( size, na.rm = TRUE),
            median = median( size, na.rm = TRUE))

#   age          mean median
#            
# 1 (0,100]     10.0    9.92
# 2 (100,200]    9.88  10.2 
# 3 (200,300]   10.1   10.1 
# 4 (300,400]    9.83   9.80
# 5 (400,500]    9.95   9.72
# 6 (500,600]    9.68   9.78
# 7 (600,700]   10.2   10.5 
# 8 (700,800]   10.2   10.4 
# 9 (800,900]    9.68   9.47
#10 (900,1e+03]  9.80   9.81
# … with 30 more rows

Distilling summary statistics by numerical categories with dplyr

Answers (1)

Related Questions