Legend
Legend

Reputation: 116780

Why do values not appear in ecdf plot?

I am trying to plot the ccdf of the data given below but for some reason, it doesn't look right. I was cross checking with some data points (2523, 313, 224) but they are not visible. Am I doing something wrong?

R Script:

# Y defined below
Y.ecdf = ecdf(Y)
curve((length((Y))*(1-Y.ecdf(x))), n = 10000, 
       from = 0, to = 100, xlab = "# of items", 
       ylab = "# instances", col=colors[1], lty=1, lwd=4)

ecdf plot

Y = c( 3, 1, 4, 11, 2, 2, 9, 7, 22, 3, 1, 1, 7, 2, 2, 2, 4, 2, 1, 1, 6, 3, 20,
15, 4, 1, 1, 5, 3, 10, 16, 224, 74, 2, 1, 2, 2, 3, 3, 7, 2, 2, 1, 4, 2, 9,
3, 3, 2, 1, 1, 3, 2, 4, 4, 1, 7, 2, 1, 2, 1, 1, 2, 4, 3, 1, 1, 1, 3, 4, 2,
2, 1, 1, 5, 6, 13, 15, 3, 1, 2, 5, 1, 1, 1, 1, 2, 6, 1, 4, 1, 3, 1, 1, 4,
2, 2, 3, 3, 1, 4, 2, 1, 4, 6, 1, 1, 1, 1, 2, 5, 2, 1, 1, 1, 1, 1, 3, 1, 3,
2, 1, 1, 1, 2, 1, 8, 2, 3, 1, 1, 1, 1, 1, 3, 1, 3, 2, 1, 2, 1, 1, 5, 1, 1,
4, 3, 3, 1, 1, 1, 3, 4, 4, 3, 2, 2, 4, 3, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3,
2, 3, 9, 3, 4, 2, 1, 1, 1, 3, 22, 5, 13, 1, 1, 1, 1, 1, 4, 1, 1, 31, 1, 1,
2, 1, 1, 1, 3, 4, 4, 8, 6, 6, 7, 2, 1, 2, 2, 5, 1, 2, 6, 6, 1, 3, 1, 5, 2,
1, 5, 3, 1, 2, 2, 1, 2, 1, 2, 2, 1, 2, 1, 1, 4, 1, 3, 2, 1, 4, 1, 212, 2,
7, 7, 10, 2, 4, 2, 1, 1, 1, 2, 3, 2, 1, 16, 6, 2, 10, 2, 1, 1, 15, 1, 3, 8,
1, 1, 3, 1, 1, 2, 1, 1, 4, 2, 3, 1, 1, 1, 1, 5, 9, 4, 1, 1, 2, 5, 1, 4, 9,
6, 19, 1, 1, 1, 2, 10, 6, 9, 5, 11, 6, 8, 1, 1, 1, 1, 1, 313, 3, 1, 3, 1,
2, 2, 2, 3, 4, 5, 1, 1, 3, 1, 1, 5, 4, 2, 5, 1, 20, 4, 1, 2, 1, 1, 1, 2, 5,
4, 2, 3, 1, 3, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 1, 1, 3, 3, 1, 1, 1, 8, 1, 1, 1, 1,
1, 1, 2, 2, 2, 2, 4, 13, 1, 2, 1, 2, 3, 3, 1, 2, 2, 1, 3, 4, 1, 1, 1, 1, 2,
2, 4, 5, 3, 2, 2, 2, 1, 1, 3, 2523, 7, 4, 2, 4, 11, 8, 1, 4, 4, 2, 5, 3, 3,
1, 3, 1, 3, 4, 1, 1, 1, 1, 6, 6, 2, 2, 1, 8, 8, 3, 3, 4, 5, 2, 2, 2, 3, 2,
6, 2, 2, 2, 1, 5, 5, 4, 3, 1, 2, 2, 6, 3, 2, 2, 2, 10, 9, 1, 2, 1, 1, 1, 2,
2, 3, 1, 3, 1, 9, 1, 1, 1, 2, 1, 96, 2, 2, 5, 1, 1, 1, 2, 2, 1, 1, 1, 5, 2,
1, 1, 1, 2, 1, 1, 4, 2, 10, 3, 2, 2, 8, 8, 2, 1, 2, 4, 1, 1, 13, 20, 3, 2,
5, 9, 1, 22, 25, 4, 1, 1, 3, 2, 1, 1, 7, 9, 5, 9, 1, 3, 1, 8, 2, 2, 1, 3,
1, 2, 6, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 16, 3, 5, 2)

Upvotes: 2

Views: 524

Answers (1)

joran
joran

Reputation: 173517

Expanding on our discussion in the comments...

An empirical cumulative distribution function is a plot of X (x axis) vs. Pr(X < x) (y axis). So for your example it would look something like this:

plot(Y.ecdf,do.points = FALSE,
     verticals = TRUE,col = "blue",
     xlab = "x", ylab = "Pr(X < x)")

enter image description here

If you look very closely you can see where the line goes up when you reach your very large values, but it's hard to make out since so many of your values are less than 10.

What you've done is to invert this function so that you're looking at the opposite tail of the distribution, i.e. Pr(X > x). You've also scaled the probabilities on the y axis. I'm not sure why, but whatever. It might make sense given your particular task. So you're doing something like this (but with the y axis scaling):

curve((1-Y.ecdf(x)), n = 10000, 
       from = 0, to = 2600, ylab = "Pr(X > x)", 
       xlab = "x", col="blue", lty=1, lwd=2)

enter image description here

but you originally had the from and to arguments set to only plot the function from 0 to 100. If you wanted to "zoom in" on your outliers, you could just change the from and to values to something more relevant:

curve((1-Y.ecdf(x)), n = 10000, 
       from = 250, to = 2600, ylab = "Pr(X > x)", 
       xlab = "x", col="blue", lty=1, lwd=2)

enter image description here

Upvotes: 2

Related Questions