Efficient way to summarise, re-group and plot data set of group frequencies in R

Question

I have a set of data for three groups (A, B, C) which gives information on how often a certain value "x" (between -3 and +3) is observed for that group (0 to 100). To give a simplified example:

df <- data.frame(x = seq(-3, 3, 1),
                 A = c(0, 10, 25, 30, 15, 0, 0),
                 B = c(25, 30, 24, 29, 2, 15, 0),
                 C = c(0, 0, 5, 10, 20, 30, 30))

The actual data set is quite big, however, so there is a large number of very detailed x values (at least two decimals) for which each group has associated frequencies, which often drop to near-zero for certain x values. When plotting this using the command below, the result looks rather convoluted.

df <- melt(df, id = "x")
ggplot(df, aes(x=x, y=value, color=variable)) + geom_line()

What would be the best way to calculate summary statistics for each group? (mean, median, ...)
What would be the most effective way to aggregate x values and their neighbours into ranges of x values, summing up the associated group frequencies in the process, to get a more generalised picture?
How would one tell ggplot to produce a histogram or density plot which accounts for the observed frequencies, so that one would get a plot which looks like this?

I thought of iterating over the data set and doing all of the above "manually", but figured that this would be inefficient and prone to errors. Any suggestions you may have would be greatly appreciated!

AntoniosK · Accepted Answer

In order to create a histogram you need to remove the "value" variable and create the corresponding number of rows for "x" based on that value. So, if for group A you have x = 3 and value = 10, the process has to create x = 3 for group A 10 times. Run the process step by step to see how it works. I've included decimals for "x".

library(reshape2)
library(dplyr)
library(ggplot2)

set.seed(22)
df <- data.frame(x = seq(-3, 3, 0.01),
                 A = round(c(rnorm(200, 30,3),rnorm(401,20,4))),
                 B = round(c(rexp(300, 1/5), rexp(301,1/20))),
                 C = round(runif(601, 2, 25)))

df <- melt(df, id = "x")


# create number of rows for each x and group based on the value
df2=
    df %>% 
  rowwise() %>% 
  do(data.frame(x = rep(.$x, .$value),
                variable = rep(.$variable, .$value))) %>%
  ungroup


# check mean and median x values for each group
df2 %>% 
  group_by(variable) %>% 
  summarise(N = n(),
            MEAN_X= mean(x),
            MEDIAN_X= median(x))

#   variable     N      MEAN_X MEDIAN_X
# 1        A 13979 -0.27480292    -0.47
# 2        B  7051  0.84527159     1.03
# 3        C  7906 -0.03190741    -0.07



ggplot(df2, aes(x=x, fill=variable)) +
  geom_histogram(binwidth=.2, alpha=.5, position="dodge")

ggplot(df2, aes(x=x, colour=variable)) + 
  geom_density()

If you want to group x for each group in terms of the frequencies you can use a regression tree method that will split x into bins and will give you the break-point(s):

library(party)

# tree for group A only
model = ctree(value~x+variable, data = df[df$variable=="A",])

plot(model, type = "simple")

This tells you that for group A there's a break point at x = -1.01 (you can visualise from the histograms as well) which splits x in two groups. The left side averages a value = 29.8 and the right side averages a value = 19.99. The number of observations in each bin are 200 and 401 respectively. Which sounds correct, as I've created this variable like that in the beginning.

Note that the trees are statistical models, which split your variable(s) based on statistical significant differences (or other metrics). You can't force any grouping by yourself. If you want to do that it's better to group your variable "x" in N groups (based on quantiles maybe? or something else that makes more sense to you) and see how the value changes within those groups.

Efficient way to summarise, re-group and plot data set of group frequencies in R

Answers (1)

Related Questions