Reputation: 1393
The source of this data is server performance metrics. The numbers I have are the mean (os_cpu) and standard deviation (os_cpu_sd). Mean clearly doesn't tell the whole story, so I want to add standard deviation. I started down the path of geom_errorbar, however I believe this is for standard error. What would be an accepted way to plot these metrics? Below is a reproducible example:
DF_CPU <- structure(list(end = structure(c(1387315140, 1387316340, 1387317540,
1387318740, 1387319940, 1387321140, 1387322340, 1387323540, 1387324740,
1387325940, 1387327140, 1387328340, 1387329540, 1387330740, 1387331940,
1387333140, 1387334340, 1387335540, 1387336740, 1387337940, 1387339140,
1387340340, 1387341540, 1387342740, 1387343940, 1387345140, 1387346340,
1387347540, 1387348740, 1387349940), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), os_cpu = c(14.8, 15.5, 17.4, 15.6, 14.9, 14.6,
15, 15.2, 14.6, 15.2, 15, 14.5, 14.8, 15, 14.6, 14.9, 14.9, 14.4,
14.8, 14.9, 14.5, 15, 14.6, 14.5, 15.3, 14.6, 14.6, 15.2, 14.5,
14.5), os_cpu_sd = c(1.3, 2.1, 3.2, 3.3, 0.9, 0.4, 1.4, 1.5,
0.4, 1.6, 1, 0.4, 1.4, 1.4, 0.4, 1.3, 0.9, 0.4, 1.4, 1.3, 0.4,
1.7, 0.4, 0.4, 1.7, 0.4, 0.4, 1.7, 0.5, 0.4)), .Names = c("end",
"os_cpu", "os_cpu_sd"), class = "data.frame", row.names = c(1L,
5L, 9L, 13L, 17L, 21L, 25L, 29L, 33L, 37L, 41L, 45L, 49L, 53L,
57L, 61L, 65L, 69L, 73L, 77L, 81L, 85L, 89L, 93L, 97L, 101L,
105L, 109L, 113L, 117L))
head(DF_CPU)
end os_cpu os_cpu_sd
1 2013-12-17 21:19:00 14.8 1.3
5 2013-12-17 21:39:00 15.5 2.1
9 2013-12-17 21:59:00 17.4 3.2
13 2013-12-17 22:19:00 15.6 3.3
17 2013-12-17 22:39:00 14.9 0.9
ggplot(data=DF_CPU, aes(x=end, y=os_cpu)) +
geom_line()+
geom_errorbar(aes(ymin=os_cpu-os_cpu_sd,ymax=os_cpu+os_cpu_sd), alpha=0.2,color="red")
Per @ari-b-friedman suggestion, here's what it looks like with geom_ribbon():
Upvotes: 2
Views: 1492
Reputation: 59395
Your question is largely about aesthetics, and so opinions will differ. Having said that there are some guidelines:
So this code:
ggplot(data=DF_CPU, aes(x=end, y=os_cpu)) +
geom_point(size=3, shape=1)+
geom_line(linetype=2, colour="grey")+
geom_linerange(aes(ymin=os_cpu-1.96*os_cpu_sd,ymax=os_cpu+1.96*os_cpu_sd), alpha=0.5,color="blue")+
ylim(0,max(DF_CPU$os_cpu+1.96*DF_CPU$os_cpu_sd))+
stat_smooth(formula=y~1,se=TRUE,method="lm",linetype=2,size=1)+
theme_bw()
Produces this:
This graphic emphasizes that cpu utilization (??) over 20 min intervals did not deviate significantly from the average for the 9 hour period monitored. The reference line is average utilization. The error bars were replaced with geom_linerange(...)
because the horizontal bars in geom_errorbar(...)
add nothing and are distracting. Also, your original plot makes it seem that error is very large compared to actual utilization, which it isn't. I changed the range to +/- 1.96*sd
because that more closely approximates 95% CL. Finally, the x- and y-axis labels need to be replaced with something descriptive, but I don't have enough information to do that.
Upvotes: 4
Reputation: 94237
There's a designer's adage that "form follows function", and this should apply to graphics. What are you trying to do with your plots? What's the question you are trying to answer?
If it is "is cpu usage significantly decreasing with time?" then this plot will probably do and gives the answer "no". If it is "is the probability of exceeding 10s changing with time?" then you need to assume a model for your data (eg something as simple as Normal(os_cpu, os_cpu_sd)) and then plot exceedence (tail) probabilities.
Anyway, just plotting means and envelopes like you have done is always a fair start, and at least answers the questions "what does my data look like?" and "is anything obviously wrong?"
Upvotes: 2