Reputation: 2131
I have a Vector
of Vector
s of different length W
. These last vectors contain integers between 0 and 150,000 in steps of 5 but can also be empty. I am trying to compute the empirical cdf for each of those vectors. I could compute these cdf iterating over every vector and every integer like this
cdfdict = Dict{Tuple{Int,Int},Float64}()
for i in 1:length(W)
v = W[i]
len = length(v)
if len == 0
pcdf = 1.0
else
for j in 0:5:150_000
pcdf = length(v[v .<= j])/len
cdfdict[i, j] = pcdf
end
end
end
However, this approach is inefficient because the cdf will be equal to 1 for j >= maximum(v)
and sometimes this maximum(v)
will be much lower than 150,000.
My question is: how can I include a condition that breaks out of the j
loop for j > maximum(v)
but still assigns pcdf = 1.0
for the rest of j
s?
I tried including a break
when j > maximum(v)
but this, of course, stops the loop from continuing for the rest of j
s. Also, I can break the loop and then use get!
to access/include 1.0
for keys not found in cdfdict
later on, but that is not what I'm looking for.
Upvotes: 2
Views: 1709
Reputation: 18217
Another answer gave an implementation using an Array which calculated the CDF by sorting the samples and filling up the CDF bins with quantile values. Since the whole Array is thus filled, doing another pass on the array should not be overly costly (we tolerate a single pass already). The sorting bit and the allocation accompanying it can be avoided by calculating a histogram in the array and using cumsum
to produce a CDF. Perhaps the code will explain this better:
Initialize sizes, lengths and widths:
n = 10; w = 5; rmax = 150_000; hl = length(0:w:rmax)
Produce a sample example:
W = [rand(0:mv,rand(0:10)) for mv in floor(Int,exp(log(rmax)*rand(n)))];
Calculate the CDFs:
cdfmat = zeros(Float64,n,hl); # empty histograms
for i=1:n # drop samples into histogram bins
for j=1:length(W[i])
cdfmat[i,1+(W[i][j]+w-1)÷5]+=one(Float64)
end
end
cumsum!(cdfmat,cdfmat,2) # calculate pre-CDF by cumsum
for i=1:n # normalize each CDF by total
if cdfmat[i,hl]==zero(Float64) # check if histogram empty?
for j=1:hl # CDF of 1.0 as default (might be changed)
cdfmat[i,j] = one(Float64)
end
else # the normalization factor calc-ed once
f = one(Float64)/cdfmat[i,hl]
for j=1:hl
cdfmat[i,j] *= f
end
end
end
(a) Note the use of one
,zero
to prepare for change of Real type - this is good practice. (b) Also adding various @inbounds
and @simd
should optimize further. (c) Putting this code in a function is recommended (this is not done in this answer). (d) If having a zero CDF for empty samples is OK (which means no samples means huge samples semantically), then the second for
can be simplified.
See other answers for more options, and reminder: Premature optimization is the root of all evil (Knuth??)
Upvotes: 1
Reputation: 18217
To elaborate on my comment, this answer details an implementation which fills an Array instead of a Dict.
First to create a random test case:
W = [rand(0:mv,rand(0:10)) for mv in floor(Int,exp(log(150_000)*rand(10)))]
Next create an array of the right size filled with 1.0s:
cdfmat = ones(Float64,length(W),length(0:5:150_000));
Now to fill the beginning of the CDFs:
for i=1:length(W)
v = sort(W[i])
k = 1
thresh = 0
for j=1:length(v)
if (j>1 && v[j]==v[j-1])
continue
end
pcdf = (j-1)/length(v)
while thresh<v[j]
cdfmat[i,k]=pcdf
k += 1
thresh += 5
end
end
end
This implementation uses a sort
which can be slow sometimes, but the other implementations basically compare the vector with various values which is even slower in most cases.
Upvotes: 2
Reputation: 12051
You can use another for
loop to set all remaining elements to 1.0. The inner loop becomes
m = maximum(v)
for j in 0:5:150_000
if j > m
for k in j:5:150_000
cdfdict[i, k] = 1.0
end
break
end
pcdf = count(x -> x <= j, v)/len
cdfdict[i, j] = pcdf
end
However, this is rather hard to understand. It would be easier to use a branch. In fact, this should be just as fast because the branch is very predictable.
m = maximum(v)
for j in 0:5:150_000
if j > m
cdfdict[i, j] = 1.0
else
pcdf = count(x -> x <= j, v)/len
cdfdict[i, j] = pcdf
end
end
Upvotes: 2
Reputation: 19132
break
only does one level. You can do what you want by wrapping the for loop function and using return
(instead of where you would've put break), or using @goto
.
Or where you would break, you could switch a boolean breakd=true
and then break, and at the bottom of the larger loop do if breakd break end
.
Upvotes: 2