Reputation: 244
I have a large (rows > 200000) data frame with dozens of columns of data. I want to distill this data frame down and summarize the number of data that have variables that fall within given ranges.
For instance, if I have a data.frame
that is similar to this:
set.seed(10)
df <- data.frame( age = runif( n = 1000, min = 0, max = 4000 ),
size = rnorm( n = 1000, mean = 10, sd = 1 ),
shape = rnorm( n = 1000, mean = 1000, sd = 1000) )
and I would like to group get the number
of samples within a series of age ranges, the mean size
and shape
, and the median size
and shape
from the samples in each of those age brackets.
Something like
summary.df <- data.frame( age.group = seq( 0, 3900, by = 100 ),
number = (number of samples in age bin),
mean = ( mean of data in age bin ) )
etc.
Right now I am doing this very bluntly by creating a new data.frame
for each age group.
data.1 <- subset( df, age > 0 & age <= 100 )
data.2 <- subset( df, age > 100 & age <= 200 )
data.3 <- subset( df, age > 200 & age <= 300 )
etc. and then adding a categorical variable
data.1 <- data.frame( data.1, age.group = "100", count.row = nrow( data.1 ) )
data.2 <- data.frame( data.2, age.group = "200", count.row = nrow( data.2 ) )
data.3 <- data.frame( data.3, age.group = "300", count.row = nrow( data.3 ) )
adding them together
data.big <- rbind( data.1, data.2, data.3 )
and then generating summary stats via dplyr
data.summary <- data.big %>%
group_by( age.group ) %>%
summarize( count.row = mean( count.row ),
mean = mean( size, na.rm = TRUE ),
median = median( size, na.rm = T ) )
How would I go about doing this more efficiently with just dplyr
? I think there must be a way but I can't wrap my head around it.
Thanks for any help you can give!
Upvotes: 0
Views: 108
Reputation: 388982
You can make use of cut
to divide the data in intervals of 100 and calculate summary statistics for each group.
library(dplyr)
df %>%
group_by(age = cut(age, seq( 0, 4000, by = 100))) %>%
summarise(mean = mean( size, na.rm = TRUE),
median = median( size, na.rm = TRUE))
# age mean median
# <fct> <dbl> <dbl>
# 1 (0,100] 10.0 9.92
# 2 (100,200] 9.88 10.2
# 3 (200,300] 10.1 10.1
# 4 (300,400] 9.83 9.80
# 5 (400,500] 9.95 9.72
# 6 (500,600] 9.68 9.78
# 7 (600,700] 10.2 10.5
# 8 (700,800] 10.2 10.4
# 9 (800,900] 9.68 9.47
#10 (900,1e+03] 9.80 9.81
# … with 30 more rows
Upvotes: 1