Neal
Neal

Reputation: 199

Making ggplot2's "geom_point" variable depending on certain conditions

I have an R script that generates plots based on the run time data from a simulation. However, sometimes there are errors during the runs which result in null run time values and lead to graphics that make it seem like the run time is smaller than it really was.

Here's an example of what the data in the "data" data frame might look like:

| Version | TotalMean | TestNum |  Case |
|:-------:|:---------:|:-------:|:-----:|
| 1.0.1   |       350 |       1 | Case1 |
| 1.0.2   |       430 |       2 | Case1 |
| 1.0.4   |       470 |       3 | Case1 |
| 1.0.7   |       445 |       4 | Case1 |
| 1.0.1   |       320 |       1 | Case2 |
| 1.0.2   |       280 |       2 | Case2 |
| 1.0.4   |       450 |       3 | Case2 |
| 1.0.7   |       420 |       4 | Case2 |
| 1.0.1   |       335 |       1 | Case3 |
| 1.0.2   |       415 |       2 | Case3 |
| 1.0.4   |       465 |       3 | Case3 |
| 1.0.7   |       430 |       4 | Case3 |
| 1.0.1   |       310 |       1 | Case4 |
| 1.0.2   |       375 |       2 | Case4 |
| 1.0.4   |       425 |       3 | Case4 |
| 1.0.7   |       410 |       4 | Case4 |

Note that there are no null values listed in that table. That's because the way that the TotalMean column is calculated will never reflect that. However, there are nulls found in the data frame that TotalMean is calculated from. Is there any way that I could make geom_point dependent on whether there are null values in a certain table? Maybe change the shape and size?

Use the code below to create a working example. Version 1.0.2 in Case2 has an anomalous value because it had null values in the original table.

library(ggplot2)

Version <- c("1.0.1","1.0.2","1.0.4","1.0.7","1.0.1","1.0.2","1.0.4","1.0.7","1.0.1","1.0.2","1.0.4","1.0.7","1.0.1","1.0.2","1.0.4","1.0.7")
TotalMean <- c(350,430,470,445,320,280,450,420,335,415,465,430,310,375,425,410)
TestNum <- c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
Case <- c("Case1","Case1","Case1","Case1","Case2","Case2","Case2","Case2","Case3","Case3","Case3","Case3","Case4","Case4","Case4","Case4")
data <- data.frame(Version,TotalMean,TestNum,Case)
versions <- unique(data[order(data$TestNum), ][,1])
data$Version <- factor(data$Version, levels = versions)

Here's the code that I use to create a chart like I use. (using ggplot2)

g<-ggplot(data, aes(color = Case, x = Version, y = TotalMean, group = Case)) + 
    geom_line() + geom_point(shape = 16, size = 2) + coord_cartesian(ylim=c(0,550)) + 
    labs(x="Version", y="Run Time (minutes)") + 
    stat_summary(fun.y=sum, geom="line") +
    theme(plot.title = element_text(face = "bold", size = 16, vjust = 1.5)) + 
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + 
    theme(axis.title.y = element_text(vjust = 1))
g

Upvotes: 1

Views: 1692

Answers (1)

M--
M--

Reputation: 28825

I made the data frame (structure provided at the bottom) that looks like this:

#    Version First_Run Second_Run TestNum  Case 
# 1    1.0.1       350        350       1 Case1 
# 2    1.0.2       430        430       2 Case1 
# 3    1.0.4       470        470       3 Case1 
# 4    1.0.7       445        445       4 Case1 
# 5    1.0.1       320        320       1 Case2 
# 6    1.0.2       560         NA       2 Case2 
# 7    1.0.4       450        450       3 Case2 
# 8    1.0.7       420        420       4 Case2 
# 9    1.0.1       335        335       1 Case3 
# 10   1.0.2       415        415       2 Case3 
# 11   1.0.4       465        465       3 Case3 
# 12   1.0.7       430        430       4 Case3 
# 13   1.0.1       310        310       1 Case4 
# 14   1.0.2       375        375       2 Case4 
# 15   1.0.4       425        425       3 Case4 
# 16   1.0.7       410        410       4 Case4

Then I calculated the mean and a column for shape:

data$TotalMean <- rowMeans(subset(data, select = c(First_Run, Second_Run)), na.rm = TRUE)

data$shapeflag <- ifelse(is.na(data$First_Run * data$Second_Run), "b", "a")

Note: na.rm = TRUE omits NA in the calculation of mean so you can have that in your calculations as well to adjust the mean while still has the shapeflag column to identify the specific runs that returned NULL. You can see that it returned 560 for the sixth row instead of 280.

This would be how the dataset looks finally:

#    Version First_Run Second_Run TestNum  Case TotalMean shapeflag 
# 1    1.0.1       350        350       1 Case1       350         a 
# 2    1.0.2       430        430       2 Case1       430         a 
# 3    1.0.4       470        470       3 Case1       470         a 
# 4    1.0.7       445        445       4 Case1       445         a 
# 5    1.0.1       320        320       1 Case2       320         a 
# 6    1.0.2       560         NA       2 Case2       560         b 
# 7    1.0.4       450        450       3 Case2       450         a 
# 8    1.0.7       420        420       4 Case2       420         a 
# 9    1.0.1       335        335       1 Case3       335         a 
# 10   1.0.2       415        415       2 Case3       415         a 
# 11   1.0.4       465        465       3 Case3       465         a 
# 12   1.0.7       430        430       4 Case3       430         a 
# 13   1.0.1       310        310       1 Case4       310         a 
# 14   1.0.2       375        375       2 Case4       375         a 
# 15   1.0.4       425        425       3 Case4       425         a 
# 16   1.0.7       410        410       4 Case4       410         a

Now we can set the shape based on a variable in the data frame within aes:

g<-ggplot(data, aes(color = Case, x = Version, y = TotalMean, group = Case,
                    shape = shapeflag)) + #Set the shape
  geom_line() + geom_point(size = 3) + coord_cartesian(ylim=c(0,550)) + 
  labs(x="Version", y="Run Time (minutes)") + 
  stat_summary(fun.y=sum, geom="line") +
  theme(plot.title = element_text(face = "bold", size = 16, vjust = 1.5)) + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + 
  theme(axis.title.y = element_text(vjust = 1)) +
  scale_shape_discrete(labels=c("norm","null"),name="runs") #Edit the legend

This would be the plot:


>g

             https://i.sstatic.net/Y4Lce.png

Data:

data <- 
       structure(list(Version = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 
       3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("1.0.1", 
       "1.0.2", "1.0.4", "1.0.7"), class = "factor"), First_Run = c(350, 
       430, 470, 445, 320, 560, 450, 420, 335, 415, 465, 430, 310, 375, 
       425, 410), Second_Run = c(350, 430, 470, 445, 320, NA, 450, 420, 
       335, 415, 465, 430, 310, 375, 425, 410), TestNum = c(1, 2, 3, 
       4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4), Case = structure(c(1L, 
       1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("Case1", 
       "Case2", "Case3", "Case4"), class = "factor")), .Names = c("Version", 
       "First_Run", "Second_Run", "TestNum", "Case"), row.names = c(NA, 
       -16L), class = "data.frame")

Upvotes: 1

Related Questions