kmsquire
kmsquire

Reputation: 1452

How can I do a bitwise-or reduction along an axis of a boolean array in Julia?

I'm trying to find the best way to do a bitwise-or reduction of a 3D boolean array of masks to 2D in Julia.

I can always write a for loop, of course:

x = randbool(3,3,3)
out = copy(x[:,:,1])
for i = 1:3
    for j = 1:3
        for k = 2:3
            out[i,j] |= x[i,j,k]
        end
    end
end

But I'm wondering if there is a better way to do the reduction.

Upvotes: 3

Views: 584

Answers (4)

simonster
simonster

Reputation: 556

It is faster to devectorize. It's just a matter of how much work you want to put in. The naïve devectorized approach is slow because it's a BitArray: extracting contiguous regions and bitwise OR can both be done a 64-bit chunk at a time, but the naïve devectorized approach operates an element at a time. On top of that, indexing BitArrays is slow, both because there is a sequence of bit operations involved and because it can't presently be inlined due to the bounds check. Here's a strategy that is devectorized but exploits the structure of the BitArray. Most of the code is copy-pasted from copy_chunks! in bitarray.jl and I didn't try to prettify it (sorry!).

function devec(n::Int, x::BitArray)
    src = x.chunks
    out = falses(n, n)
    dest = out.chunks
    numbits = n*n

    kd0 = 1
    ld0 = 0
    for j = 1:n
        pos_s = (n*n)*(j-1)+1
        kd1, ld1 = Base.get_chunks_id(numbits - 1)
        ks0, ls0 = Base.get_chunks_id(pos_s)
        ks1, ls1 = Base.get_chunks_id(pos_s + numbits - 1)

        delta_kd = kd1 - kd0
        delta_ks = ks1 - ks0

        u = Base._msk64
        if delta_kd == 0
            msk_d0 = ~(u << ld0) | (u << (ld1+1))
        else
            msk_d0 = ~(u << ld0)
            msk_d1 = (u << (ld1+1))
        end
        if delta_ks == 0
            msk_s0 = (u << ls0) & ~(u << (ls1+1))
        else
            msk_s0 = (u << ls0)
        end

        chunk_s0 = Base.glue_src_bitchunks(src, ks0, ks1, msk_s0, ls0)

        dest[kd0] |= (dest[kd0] & msk_d0) | ((chunk_s0 << ld0) & ~msk_d0)

        delta_kd == 0 && continue

        for i = 1 : kd1 - kd0
            chunk_s1 = Base.glue_src_bitchunks(src, ks0 + i, ks1, msk_s0, ls0)

            chunk_s = (chunk_s0 >>> (64 - ld0)) | (chunk_s1 << ld0)

            dest[kd0 + i] |= chunk_s

            chunk_s0 = chunk_s1
        end
    end

    out
end

With Iain's benchmarks, this gives me:

simple: 0.051321131 seconds (46356000 bytes allocated, 30.03% gc time)
forloops: 6.226652258 seconds (92976 bytes allocated)
forloopscolfirst: 2.099381939 seconds (89472 bytes allocated)
shorty: 0.060194226 seconds (46387760 bytes allocated, 36.27% gc time)
timholy: 2.464298752 seconds (31784 bytes allocated)
devec: 0.008734413 seconds (31472 bytes allocated)

Upvotes: 1

IainDunning
IainDunning

Reputation: 11664

A simple answer would be

out = x[:,:,1] | x[:,:,2] | x[:,:,3]

but I did some benchmarking:

function simple(n,x)
    out = x[:,:,1] | x[:,:,2]
    for k = 3:n
        @inbounds out |= x[:,:,k]
    end
    return out
end

function forloops(n,x)
    out = copy(x[:,:,1])
    for i = 1:n
        for j = 1:n
            for k = 2:n
                @inbounds out[i,j] |= x[i,j,k]
            end
        end
    end
    return out
end

function forloopscolfirst(n,x)
    out = copy(x[:,:,1])
    for j = 1:n
        for i = 1:n
            for k = 2:n
                @inbounds out[i,j] |= x[i,j,k]
            end
        end
    end
    return out
end

shorty(n,x) = |([x[:,:,i] for i in 1:n]...)

timholy(n,x) = any(x,3)

function runtest(n)
    x = randbool(n,n,n)

    @time out1 = simple(n,x)
    @time out2 = forloops(n,x)
    @time out3 = forloopscolfirst(n,x)
    @time out4 = shorty(n,x)
    @time out5 = timholy(n,x)

    println(all(out1 .== out2))
    println(all(out1 .== out3))
    println(all(out1 .== out4))
    println(all(out1 .== out5))
end

runtest(3)
runtest(500)

which gave the following results

# For 500
simple: 0.039403016 seconds (39716840 bytes allocated)
forloops: 6.259421683 seconds (77504 bytes allocated)
forloopscolfirst 1.809124505 seconds (77504 bytes allocated)
shorty: elapsed time: 0.050384062 seconds (39464608 bytes allocated)
timholy: 2.396887396 seconds (31784 bytes allocated)

So I'd go with simple or shorty

Upvotes: 2

waTeim
waTeim

Reputation: 9244

There are various standard optimization tricks and hints that can be applied, but the critical observation to make here is that Julia organizes array in column-major rather than row-major order. For small size arrays this is not easily seen but when the arrays grow large it's telling. There is a method reduce provided that is optimized to perform an function on a collection (in this case OR), but it comes at a cost. If the number of combining steps is relatively small then it's better to simply loop. In all cases minimizing the number of memory access is over all better. Below are various attempts at optimization using these 2 things in mind.

Various Attempts and Observations

Initial function

Here's a function that takes your example and generalizes it.

function boolReduce1(x)
   out = copy(x[:,:,1])
   for i = 1:size(x,1)
       for j = 1:size(x,2)
           for k = 2:size(x,3)
               out[i,j] |= x[i,j,k]
           end
       end
   end
   out
end

Creating a fairly large array, we can time it's performance

julia> @time boolReduce1(b); 
elapsed time: 42.372058096 seconds (1056704 bytes allocated)

Applying optimizations

Here's another similar version but with the standard type hints, use of @inbounds and inverting the loops.

function boolReduce2(b::BitArray{3})
   a = BitArray{2}(size(b)[1:2]...)
   for j = 1:size(b,2)
      for i = 1:size(b,1)
         @inbounds a[i,j] = b[i,j,1]
         for k = 2:size(b,3)
            @inbounds a[i,j] |= b[i,j,k]
         end
      end
   end
   a
end

And take the time

julia> @time boolReduce2(b);
elapsed time: 12.892392891 seconds (500520 bytes allocated)

The insight

The 2nd function is a lot faster, and also less memory is allocated because a temporary array wasn't created. But what if we simply take the first function and invert the array indexing?

function boolReduce3(x)
   out = copy(x[:,:,1])
   for j = 1:size(x,2)
       for i = 1:size(x,1)
           for k = 2:size(x,3)
               out[i,j] |= x[i,j,k]
           end
       end
   end
   out
end

and take the time now

julia> @time boolReduce3(b);
elapsed time: 12.451501749 seconds (1056704 bytes allocated)

That's just as fast as the 2nd function.

Using reduce

There is a function called reduce that we can use to eliminate the 3rd loop. Its function is to repeatedly apply an operation on all of the elements with the result of the previous operation. This is exactly what we want.

function boolReduce4(b)
   a = BitArray{2}(size(b)[1:2]...)
   for j = 1:size(b,2)
      for i = 1:size(b,1)
         @inbounds a[i,j] = reduce(|,b[i,j,:])
     end
   end
   a
end

Now take it's time

julia> @time boolReduce4(b);
elapsed time: 15.828273008 seconds (1503092520 bytes allocated, 4.07% gc time)

That's ok, but not even as fast as the simple optimized original. The reason is, take a look at all of the extra memory that was allocated. This is because data has to be copied from all over to produce input for reduce.

Combining things

But what if we max out the insight as best we can. Instead of the last index being reduced, the first one is?

function boolReduceX(b)
   a = BitArray{2}(size(b)[2:3]...)
   for j = 1:size(b,3)
      for i = 1:size(b,2)
         @inbounds a[i,j] = reduce(|,b[:,i,j])
      end
   end
   a
end

And now create a similar array and time it.

julia> c = randbool(200,2000,2000);

julia> @time boolReduceX(c);
elapsed time: 1.877547669 seconds (927092520 bytes allocated, 21.66% gc time)

Resulting in a function 20x faster than the original version for large arrays. Pretty good.

But what if medium size?

If the size is very large then the above function appears best, but if the data set size is smaller, the use of reduce doesn't pay enough back and the following is faster. Including a temporary variable speeds things from version 2. Another version of boolReduceX using a loop instead of reduce (not show here) was even faster.

function boolReduce5(b)
   a = BitArray{2}(size(b)[1:2]...)
   for j = 1:size(b,2)
      for i = 1:size(b,1)
         @inbounds t = b[i,j,1]
         for k = 2:size(b,3)
            @inbounds t |= b[i,j,k]
         end
         @inbounds a[i,j] = t
      end
   end
   a
end

julia> b = randbool(2000,2000,20);
julia> c = randbool(20,2000,2000);

julia> @time  boolReduceX(c);
elapsed time: 1.535334322 seconds (799092520 bytes allocated, 23.79% gc time)

julia> @time  boolReduce5(b);
elapsed time: 0.491410981 seconds (500520 bytes allocated)

Upvotes: 2

tholy
tholy

Reputation: 12179

Try any(x, 3). Just typing a little more here so StackOverflow doesn't nix this response.

Upvotes: 2

Related Questions