Reputation: 199
I have an R script that generates plots based on the run time data from a simulation. However, sometimes there are errors during the runs which result in null
run time values and lead to graphics that make it seem like the run time is smaller than it really was.
Here's an example of what the data in the "data" data frame might look like:
| Version | TotalMean | TestNum | Case |
|:-------:|:---------:|:-------:|:-----:|
| 1.0.1 | 350 | 1 | Case1 |
| 1.0.2 | 430 | 2 | Case1 |
| 1.0.4 | 470 | 3 | Case1 |
| 1.0.7 | 445 | 4 | Case1 |
| 1.0.1 | 320 | 1 | Case2 |
| 1.0.2 | 280 | 2 | Case2 |
| 1.0.4 | 450 | 3 | Case2 |
| 1.0.7 | 420 | 4 | Case2 |
| 1.0.1 | 335 | 1 | Case3 |
| 1.0.2 | 415 | 2 | Case3 |
| 1.0.4 | 465 | 3 | Case3 |
| 1.0.7 | 430 | 4 | Case3 |
| 1.0.1 | 310 | 1 | Case4 |
| 1.0.2 | 375 | 2 | Case4 |
| 1.0.4 | 425 | 3 | Case4 |
| 1.0.7 | 410 | 4 | Case4 |
Note that there are no null values listed in that table. That's because the way that the TotalMean
column is calculated will never reflect that. However, there are nulls found in the data frame that TotalMean
is calculated from. Is there any way that I could make geom_point
dependent on whether there are null values in a certain table? Maybe change the shape and size?
Use the code below to create a working example. Version 1.0.2 in Case2 has an anomalous value because it had null values in the original table.
library(ggplot2)
Version <- c("1.0.1","1.0.2","1.0.4","1.0.7","1.0.1","1.0.2","1.0.4","1.0.7","1.0.1","1.0.2","1.0.4","1.0.7","1.0.1","1.0.2","1.0.4","1.0.7")
TotalMean <- c(350,430,470,445,320,280,450,420,335,415,465,430,310,375,425,410)
TestNum <- c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
Case <- c("Case1","Case1","Case1","Case1","Case2","Case2","Case2","Case2","Case3","Case3","Case3","Case3","Case4","Case4","Case4","Case4")
data <- data.frame(Version,TotalMean,TestNum,Case)
versions <- unique(data[order(data$TestNum), ][,1])
data$Version <- factor(data$Version, levels = versions)
Here's the code that I use to create a chart like I use. (using ggplot2)
g<-ggplot(data, aes(color = Case, x = Version, y = TotalMean, group = Case)) +
geom_line() + geom_point(shape = 16, size = 2) + coord_cartesian(ylim=c(0,550)) +
labs(x="Version", y="Run Time (minutes)") +
stat_summary(fun.y=sum, geom="line") +
theme(plot.title = element_text(face = "bold", size = 16, vjust = 1.5)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
theme(axis.title.y = element_text(vjust = 1))
g
Upvotes: 1
Views: 1692
Reputation: 28825
I made the data frame (structure provided at the bottom) that looks like this:
# Version First_Run Second_Run TestNum Case
# 1 1.0.1 350 350 1 Case1
# 2 1.0.2 430 430 2 Case1
# 3 1.0.4 470 470 3 Case1
# 4 1.0.7 445 445 4 Case1
# 5 1.0.1 320 320 1 Case2
# 6 1.0.2 560 NA 2 Case2
# 7 1.0.4 450 450 3 Case2
# 8 1.0.7 420 420 4 Case2
# 9 1.0.1 335 335 1 Case3
# 10 1.0.2 415 415 2 Case3
# 11 1.0.4 465 465 3 Case3
# 12 1.0.7 430 430 4 Case3
# 13 1.0.1 310 310 1 Case4
# 14 1.0.2 375 375 2 Case4
# 15 1.0.4 425 425 3 Case4
# 16 1.0.7 410 410 4 Case4
Then I calculated the mean and a column for shape:
data$TotalMean <- rowMeans(subset(data, select = c(First_Run, Second_Run)), na.rm = TRUE)
data$shapeflag <- ifelse(is.na(data$First_Run * data$Second_Run), "b", "a")
Note: na.rm = TRUE
omits NA
in the calculation of mean so you can have that in your calculations as well to adjust the mean while still has the shapeflag
column to identify the specific runs that returned NULL
. You can see that it returned 560
for the sixth row instead of 280
.
This would be how the dataset looks finally:
# Version First_Run Second_Run TestNum Case TotalMean shapeflag
# 1 1.0.1 350 350 1 Case1 350 a
# 2 1.0.2 430 430 2 Case1 430 a
# 3 1.0.4 470 470 3 Case1 470 a
# 4 1.0.7 445 445 4 Case1 445 a
# 5 1.0.1 320 320 1 Case2 320 a
# 6 1.0.2 560 NA 2 Case2 560 b
# 7 1.0.4 450 450 3 Case2 450 a
# 8 1.0.7 420 420 4 Case2 420 a
# 9 1.0.1 335 335 1 Case3 335 a
# 10 1.0.2 415 415 2 Case3 415 a
# 11 1.0.4 465 465 3 Case3 465 a
# 12 1.0.7 430 430 4 Case3 430 a
# 13 1.0.1 310 310 1 Case4 310 a
# 14 1.0.2 375 375 2 Case4 375 a
# 15 1.0.4 425 425 3 Case4 425 a
# 16 1.0.7 410 410 4 Case4 410 a
Now we can set the shape based on a variable in the data frame within aes
:
g<-ggplot(data, aes(color = Case, x = Version, y = TotalMean, group = Case,
shape = shapeflag)) + #Set the shape
geom_line() + geom_point(size = 3) + coord_cartesian(ylim=c(0,550)) +
labs(x="Version", y="Run Time (minutes)") +
stat_summary(fun.y=sum, geom="line") +
theme(plot.title = element_text(face = "bold", size = 16, vjust = 1.5)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
theme(axis.title.y = element_text(vjust = 1)) +
scale_shape_discrete(labels=c("norm","null"),name="runs") #Edit the legend
This would be the plot:
>g
Data:
data <-
structure(list(Version = structure(c(1L, 2L, 3L, 4L, 1L, 2L,
3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("1.0.1",
"1.0.2", "1.0.4", "1.0.7"), class = "factor"), First_Run = c(350,
430, 470, 445, 320, 560, 450, 420, 335, 415, 465, 430, 310, 375,
425, 410), Second_Run = c(350, 430, 470, 445, 320, NA, 450, 420,
335, 415, 465, 430, 310, 375, 425, 410), TestNum = c(1, 2, 3,
4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4), Case = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("Case1",
"Case2", "Case3", "Case4"), class = "factor")), .Names = c("Version",
"First_Run", "Second_Run", "TestNum", "Case"), row.names = c(NA,
-16L), class = "data.frame")
Upvotes: 1