Agustin
Agustin

Reputation: 1526

What exactly does stat=identity mean in geom_bar ggplot?

I'm working through someone else's work and having trouble understanding the purpose of this line of code:

ggplot(data, aes(x = Group, y = Value)) + geom_bar(stat="identity", position = "dodge", lwd = 1,  aes(fill = Group))

Here's an example of the data

   Group  Value
1    A      20
2    B      74
3    B      50
...
n    A      24

I think the purpose of the code used was to plot a bar graph summarising the value of group A and of group B. However I believe a bar is plotted for every element of group A and of group B, and all these bars overlap and only the maximum value of each group is shown. Is this the case? I think the aim was to plot a summary statistic, either the mean or median rather than plot all and only being able to see the maximum.

If this is not what is happening I'd appreciate any help in understanding what the use of stat='identity' means as reading the documentation hasn't helped me much.

Thank you

Upvotes: 2

Views: 10215

Answers (1)

qiushi yan
qiushi yan

Reputation: 53

By default geom_bar() is similar to a discrete version of geom_histogram(). Without stat = "identity", it performs a statistical transformation stat_bin(Group), which counts the number of observations at each value of Group, then a variable count is generated and automatically mapped onto y-axis and represented by the height of bars. So it will be invalid to supply aes(y) without stat = "identity", since a y variable is computed by geom_bar() instead of being specified by you. stat = "identity" is useful when you don't want the height of bars to be counts of your x variable, but some value column in your data, so geom_bar() will just do nothing and wait for you to supply a y variable.

If you want to plot a summary statistic for each Group, and that summary isn't count of observations, you will have to use group_by(Group) + summarize(mean(value) / median(value)) before plotting. And then use stat = identity and aes(x = Group, y = summary), with summary being the column you want to display

Upvotes: 4

Related Questions