Dr. Manuel Kuehner
Dr. Manuel Kuehner

Reputation: 433

Extracting/Exporting the Data of the Empirical Cumulative Distribution Function in R (ecdf)

I use R to calculate the ecdf of some data. I want to use the results in another software. I use R just to do the 'work' but not to produce the final diagram for my thesis.

Example Code

# Plotting the a built in sampla data
plot(cars$speed)
# Assingning the data to a new variable name
myData = cars$speed
# Calculating the edcf
myResult = ecdf(myData)
myResult
# Plotting the ecdf
plot(myResult)

Output

> # Plotting the a built in sampla data
> plot(cars$speed)
> # Assingning the data to a new variable name
> myData = cars$speed
> # Calculating the edcf
> myResult = ecdf(myData)
> myResult
Empirical CDF 
Call: ecdf(myData)
 x[1:19] =      4,      7,      8,  ...,     24,     25
> # Plotting the ecdf
> plot(myResult)
> plot(cars$speed)

enter image description here

enter image description here

Questions

Question 1

How do I get the raw information in order to plot the ecdf diagram in another software (e. g. Excel, Matlab, LaTeX)? For the histogram function I can just write

res = hist(...)

and I find all the information like

res$breaks
res$counts
res$density
res$mids
res$xname

Question 2

How do I calculate the inverse ecdf? Say I want to know how many cars have a speed below 10 mph (the example data is car speed).

Update

Thanks to the answer of user777 I have more information now. If I use

> myResult(0:25)
 [1] 0.00 0.00 0.00 0.00 0.04 0.04 0.04 0.08 0.10 0.12 0.18 0.22 0.30 0.38
[15] 0.46 0.52 0.56 0.62 0.70 0.76 0.86 0.86 0.88 0.90 0.98 1.00

I get the data for 0 to 25 mph. But I do not know where to draw a data point. I do want to reproduce the R plot exactly.

Here I have a data point every 1 mph.

enter image description here

Here I do not have a data pint every 1 mph. I only have a data point if there is data available.

enter image description here

Solution

# Plotting the a built in sample data
plot(cars$speed)
# Assingning the data to a new variable name
myData = cars$speed
# Calculating the edcf
myResult = ecdf(myData)
myResult
# Plotting the ecdf
plot(myResult)
# Have a look on the probability for 0 to 25 mph
myResult(0:25)
# Have a look on the probability but just where there ara data points
myResult(unique(myData))
# Saving teh stuff to a directory
write.csv(cbind(unique(myData), myResult(unique(myData))), file="D:/myResult.txt")

The file myResult.txt looks like

"","V1","V2"
"1",4,0.04
"2",7,0.08
"3",8,0.1
"4",9,0.12
"5",10,0.18
"6",11,0.22
"7",12,0.3
"8",13,0.38
"9",14,0.46
"10",15,0.52
"11",16,0.56
"12",17,0.62
"13",18,0.7
"14",19,0.76
"15",20,0.86
"16",22,0.88
"17",23,0.9
"18",24,0.98
"19",25,1

Meaning

enter image description here

Attention: I have a German Excel so the decimal symbol is comma instead of the dot.

Upvotes: 1

Views: 7882

Answers (2)

Sycorax
Sycorax

Reputation: 1386

The output of ecdf is a function, among other object types. You can verify this with class(myResult), which displayes the S4 classes of the object myResult.

If you enter myResult(unique(myData)), R evaluates the ecdf object myResult at all distinct values appearing in myData, and prints it to the console. To save the output you can enter write.csv(cbind(unique(myData), myResult(unique(myData))), file="C:/Documents/My ecdf.csv") to save it to that filepath.

The ecdf doesn't tell you how many cars are above/below a specific threshold; rather, it states the probability that a randomly selected car from your data set is above/below the threshold. If you're interested in the number of cars satisfying some criteria, just count them. myData[myData<=10] returns the data elements, and length(myData[myData<=10]) tells you how many of them there are.

Assuming you mean that you want to know the sample probabilities that a randomly-selected car from your data is below 10 mph, that's the value given by myResult(10).

Upvotes: 3

Russ Lenth
Russ Lenth

Reputation: 6770

As I see it, your main requirement is to reproduce the jumps at each x value. Try this:

> x <- c(cars$speed, cars$speed, 1, 28)
> y <- c((0:49)/50, (1:50)/50, 0, 1)
> ord <- order(x)
> plot(y[ord] ~ x[ord], type="l")

Resulting plot

The first 50 (x,y) pairs are tyhe beginnings of the jumps, the next 50 are the ends, and the last two give you starting and ending values at $(x_1-3,0)$ and $(x_{50}+3,1)$. Then you need to sort the values in increasing order in $x$.

Upvotes: 3

Related Questions