Reputation: 149
I am trying to plot CDFs from multiple data on one plot for a subset range. I subset them as the values can be very large and I do not wish to have a large x-axis range. Regardless of the subset range, the CDF always has a y-axis range from 0 to 1. As data exists outside of the subset range, the CDF should never go to 1, but rather a slightly smaller ratio. How might I go about computing CDFs for the whole distribution, then subset them in the plot?
This code plots CDFs, however they do not respect that there exists data beyond the x-axis range. At or around x=50, y=1, which is impossible. I have tried a few other stat_ecdf
options (commented #
below) with no success.
library(moments)
library(ggplot2)
library(plyr)
library(dplyr)
library(reshape2)
library(RColorBrewer)
library(cowplot)
library(scales)
library(gridExtra)
require(data.table)
require(grid)
set.seed(8)
dat1 <- data.frame(a = replicate(1,sample(25:300,1000,rep=TRUE)))
dat2 <- data.frame(b = replicate(1,sample(25:350,950,rep=TRUE)))
dat3 <- data.frame(c = replicate(1,sample(25:400,965,rep=TRUE)))
dat4 <- data.frame(d = replicate(1,sample(25:450,970,rep=TRUE)))
d1_bind = bind_rows(dat1,dat2,dat3,dat4)
md1 <- melt(d1_bind)
colnames(md1) <- c("Dat","Value")
summary(md1)
ggplot(md1, aes(x = Value, color=Dat, linetype=Dat)) +
stat_ecdf(aes(color = Dat),
# pad = TRUE, # this does not plot correctly
# n = 38850, # this or set to NULL does not plot correctly
geom = "line", size = 1) +
scale_linetype_manual(values=c("solid", "solid", "solid", "solid")) +
scale_y_continuous(limits = c(0, 1.0), breaks = seq(0, 1.0, by = 0.05)) +
scale_x_continuous(limits = c(25, 50)) +
# scale_x_discrete(breaks = 26:451) + # this does not plot correctly
scale_color_manual(values = c("#000000", "#E69F00", "#56B4E9", "#009E73"))
quit()
Using stat_bin
and manually computing the cumulative sum, results in the same plot as the stat_ecdf
above.
ggplot(md1, aes(x = Value, color=Dat, linetype=Dat)) +
stat_bin(aes(y = cumsum(..count..)/sum(..count..)),
geom = "line", size = 1) +
scale_linetype_manual(values=c("solid", "solid", "solid", "solid")) +
scale_y_continuous(limits = c(0, 1.0), breaks = seq(0, 1.0, by = 0.05)) +
scale_x_continuous(limits = c(25, 50)) +
scale_color_manual(values = c("#000000", "#E69F00", "#56B4E9", "#009E73"))
Upvotes: 0
Views: 306
Reputation: 18612
Your code and the following code gives, which is what I would expect:
library(dplyr)
library(tidyr)
library(ggplot2)
set.seed(8)
dat1 <- data.frame(a = replicate(1,sample(25:300,1000,rep=TRUE)))
dat2 <- data.frame(b = replicate(1,sample(25:350,950,rep=TRUE)))
dat3 <- data.frame(c = replicate(1,sample(25:400,965,rep=TRUE)))
dat4 <- data.frame(d = replicate(1,sample(25:450,970,rep=TRUE)))
df <- bind_rows(dat1, dat2, dat3, dat4, .id = "dat")
df1 <- df %>%
pivot_longer(cols = a:d, values_drop_na = T)
ggplot(df1, aes(x = value, color = dat, linetype = dat)) +
stat_ecdf(aes(color = dat))
If you want to set the limits without recalculating the ECDF (ie "zoom" in on the graph) then use coord_cartesian
not scale_x_continuous
:
ggplot(df1, aes(x = value, color = dat, linetype = dat)) +
stat_ecdf() +
coord_cartesian(xlim = c(25, 50),
ylim = c(0, 0.1))
Upvotes: 1