Reputation: 52718
From this question we see a simple geom_line
in the answer.
library(dplyr)
BactData %>% filter(year(Date) == 2017) %>%
ggplot(aes(Date, Svartediket_CB )) + geom_line()
If we change geom_line
to geom_bar
we may expect to see a bar plot, but instead
Error: stat_count() must not be used with a y aesthetic.
But it works if we add stat = "identity"
, like so
library(dplyr)
BactData %>% filter(year(Date) == 2017) %>%
ggplot(aes(Date, Svartediket_CB )) + geom_bar(stat = "identity")
Why doesn't geom_bar
work without stat = "identity"
- i.e. what is the purpose of stat = "identity"
?
Upvotes: 38
Views: 136476
Reputation: 439
@Stevec.
I found the answer at rdocumentation.org.
See below what means stat='identity':
"The heights of the bars commonly represent one of two things: either a count of cases in each group, or the values in a column of the data frame. By default, geom_bar uses stat="bin". This makes the height of each bar equal to the number of cases in each group, and it is incompatible with mapping values to the y aesthetic. If you want the heights of the bars to represent values in the data, use stat="identity" and map a value to the y aesthetic."
Hope this was helpful.
Follow the link to documentation: geom_bar documentation
Upvotes: 8
Reputation: 1831
There are two layers that are closely related: geom_bar()
and geom_col()
. The key difference is how they aggregate the data by default.
For geom_bar()
, the default behavior is to count the rows for each x value. It doesn't expect a y-value, since it's going to count that up itself -- in fact, it will flag a warning if you give it one, since it thinks you're confused. How aggregation is to be performed is specified as an argument to geom_bar()
, which is stat = "count"
for the default value.
If you explicitly say stat = "identity"
in geom_bar()
, you're telling ggplot2
to skip the aggregation and that you'll provide the y values. This mirrors the natural behavior of geom_col()
below.
In the case of geom_col()
, it won't try to aggregate the data by default. From the docs, "geom_col()
uses stat_identity()
: it leaves the data as is". So, it expects you to already have the y values calculated and to use them directly. And geom_col()
doesn't have an argument to change that behavior - it's always going to plot your y values that you provide, and you need to provide them.
If you have y values, you could use either syntax, but I find geom_col()
more direct.
Upvotes: 53
Reputation: 1378
By default geom_bar()
uses stat_count()
to plot the frequency of cases at each level of x
(some grouping variable). By contrast, this can be overridden with stat_identity()
by supplying the argument stat = "identity"
to plot the value at each level of x
. The reason being geom_bar()
is intended to plot frequencies, otherwise a single value could be more efficiently represented by a single point.
Upvotes: 6