amrods
amrods

Reputation: 2131

breaking out of a loop in Julia

I have a Vector of Vectors of different length W. These last vectors contain integers between 0 and 150,000 in steps of 5 but can also be empty. I am trying to compute the empirical cdf for each of those vectors. I could compute these cdf iterating over every vector and every integer like this

cdfdict = Dict{Tuple{Int,Int},Float64}()
for i in 1:length(W)
    v = W[i]
    len = length(v)
    if len == 0
        pcdf = 1.0
    else
        for j in 0:5:150_000
            pcdf = length(v[v .<= j])/len
            cdfdict[i, j] = pcdf
        end
    end
end

However, this approach is inefficient because the cdf will be equal to 1 for j >= maximum(v) and sometimes this maximum(v) will be much lower than 150,000.

My question is: how can I include a condition that breaks out of the j loop for j > maximum(v) but still assigns pcdf = 1.0 for the rest of js?

I tried including a break when j > maximum(v) but this, of course, stops the loop from continuing for the rest of js. Also, I can break the loop and then use get! to access/include 1.0 for keys not found in cdfdict later on, but that is not what I'm looking for.

Upvotes: 2

Views: 1709

Answers (4)

Dan Getz
Dan Getz

Reputation: 18217

Another answer gave an implementation using an Array which calculated the CDF by sorting the samples and filling up the CDF bins with quantile values. Since the whole Array is thus filled, doing another pass on the array should not be overly costly (we tolerate a single pass already). The sorting bit and the allocation accompanying it can be avoided by calculating a histogram in the array and using cumsum to produce a CDF. Perhaps the code will explain this better:

Initialize sizes, lengths and widths:

n = 10; w = 5; rmax = 150_000; hl = length(0:w:rmax)

Produce a sample example:

W = [rand(0:mv,rand(0:10)) for mv in floor(Int,exp(log(rmax)*rand(n)))];

Calculate the CDFs:

cdfmat = zeros(Float64,n,hl);  # empty histograms
for i=1:n                      # drop samples into histogram bins
  for j=1:length(W[i])
    cdfmat[i,1+(W[i][j]+w-1)÷5]+=one(Float64)
  end
end
cumsum!(cdfmat,cdfmat,2)       # calculate pre-CDF by cumsum
for i=1:n                      # normalize each CDF by total 
  if cdfmat[i,hl]==zero(Float64) # check if histogram empty?
    for j=1:hl                 # CDF of 1.0 as default (might be changed)
      cdfmat[i,j] = one(Float64)
    end
  else                         # the normalization factor calc-ed once
    f = one(Float64)/cdfmat[i,hl]
    for j=1:hl
      cdfmat[i,j] *= f
    end
  end
end

(a) Note the use of one,zero to prepare for change of Real type - this is good practice. (b) Also adding various @inbounds and @simd should optimize further. (c) Putting this code in a function is recommended (this is not done in this answer). (d) If having a zero CDF for empty samples is OK (which means no samples means huge samples semantically), then the second for can be simplified.

See other answers for more options, and reminder: Premature optimization is the root of all evil (Knuth??)

Upvotes: 1

Dan Getz
Dan Getz

Reputation: 18217

To elaborate on my comment, this answer details an implementation which fills an Array instead of a Dict.

First to create a random test case:

W = [rand(0:mv,rand(0:10)) for mv in floor(Int,exp(log(150_000)*rand(10)))]

Next create an array of the right size filled with 1.0s:

cdfmat = ones(Float64,length(W),length(0:5:150_000));

Now to fill the beginning of the CDFs:

for i=1:length(W)
    v = sort(W[i])
    k = 1
    thresh = 0
    for j=1:length(v)
        if (j>1 && v[j]==v[j-1])
            continue
        end
        pcdf = (j-1)/length(v)
        while thresh<v[j]
            cdfmat[i,k]=pcdf
            k += 1
            thresh += 5
        end
    end
end

This implementation uses a sort which can be slow sometimes, but the other implementations basically compare the vector with various values which is even slower in most cases.

Upvotes: 2

Fengyang Wang
Fengyang Wang

Reputation: 12051

You can use another for loop to set all remaining elements to 1.0. The inner loop becomes

m = maximum(v)
for j in 0:5:150_000
    if j > m
        for k in j:5:150_000
            cdfdict[i, k] = 1.0
        end
        break
    end
    pcdf = count(x -> x <= j, v)/len
    cdfdict[i, j] = pcdf
end

However, this is rather hard to understand. It would be easier to use a branch. In fact, this should be just as fast because the branch is very predictable.

m = maximum(v)
for j in 0:5:150_000
    if j > m
        cdfdict[i, j] = 1.0
    else
        pcdf = count(x -> x <= j, v)/len
        cdfdict[i, j] = pcdf
    end
end

Upvotes: 2

Chris Rackauckas
Chris Rackauckas

Reputation: 19132

break only does one level. You can do what you want by wrapping the for loop function and using return (instead of where you would've put break), or using @goto.

Or where you would break, you could switch a boolean breakd=true and then break, and at the bottom of the larger loop do if breakd break end.

Upvotes: 2

Related Questions