Reputation: 11209
This question (Parallelize code which is doing bit wise operation) got a very good answer and extremely efficient code that I could only match with C code. That prompted me to attempt to at least match the Python code in Julia.
The following is the Python code that takes 0.64 seconds vs C code 0.27 seconds to pack bits in unsigned integers.
import numpy as np
import numba as nb
import time
colm = int(200000/8)
rows = 10000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
A = np.empty((rows,colm), dtype=np.uint8)
@nb.njit('void(uint8[:,:],int8[:,:])', parallel=True)
def compute(A, AU):
for i in nb.prange(A.shape[0]):
for j in range(A.shape[1]):
offset = j * 8
res = AU[i,offset] << 7
res |= AU[i,offset+1] << 6
res |= AU[i,offset+2] << 5
res |= AU[i,offset+3] << 4
res |= AU[i,offset+4] << 3
res |= AU[i,offset+5] << 2
res |= AU[i,offset+6] << 1
res |= AU[i,offset+7]
A[i,j] = res
start_time = time.time()
compute(A, AU)
end_time = time.time()
print(end_time - start_time)
The following are mare various failed attempts to match that performance in Julia:
using Random
colm = 200000÷8
rows = 10000
cols = colm*8
AU = zeros(UInt8, (rows, cols))
rand!(AU)
AU .&= 0x01
A = BitArray(undef, rows, cols)
B = zeros(UInt8, (rows, colm))
function compute1(A, AU)
A[:,:] .= AU .== 1
end
function compute2(A, AU)
for i in 1:size(A)[2]
start_col = (i-1) << 3
A[:, i] .= AU[:, start_col + 1] .|
(AU[:, start_col + 2] .<< 1) .|
(AU[:, start_col + 3] .<< 2) .|
(AU[:, start_col + 4] .<< 3) .|
(AU[:, start_col + 5] .<< 4) .|
(AU[:, start_col + 6] .<< 5) .|
(AU[:, start_col + 7] .<< 6) .|
(AU[:, start_col + 8] .<< 7)
end
end
function compute3(A, AU)
for i in 1:size(A)[2]
start_col = (i-1) << 3
A[:, i] .|= AU[:, start_col + 1]
A[:, i] .|= AU[:, start_col + 2] .<< 1
A[:, i] .|= AU[:, start_col + 3] .<< 2
A[:, i] .|= AU[:, start_col + 4] .<< 3
A[:, i] .|= AU[:, start_col + 5] .<< 4
A[:, i] .|= AU[:, start_col + 6] .<< 5
A[:, i] .|= AU[:, start_col + 7] .<< 6
A[:, i] .|= AU[:, start_col + 8] .<< 7
end
end
function compute4(A, AU)
for i in 0:7
au_columns = [((j-1) << 3) + i + 1 for j in 1:size(A)[2]]
A[:, :] .|= AU[:, au_columns] .<< i
end
end
@time compute1(A, AU)
@time compute2(B, AU)
@time compute3(B, AU)
@time compute4(B, AU)
Output:
6.128301 seconds (553.01 k allocations: 30.192 MiB, 2.22% compilation time)
3.640022 seconds (1.97 M allocations: 1.984 GiB, 3.05% gc time, 12.43% compilation time)
2.956211 seconds (1.44 M allocations: 3.842 GiB, 3.73% gc time, 19.24% compilation time)
6.720456 seconds (946.91 k allocations: 3.776 GiB, 2.40% gc time, 4.68% compilation time)
The different methods take between 3 and 6 seconds. Not sure how to improve the performance to at least match Python / Numba
Upvotes: 4
Views: 313
Reputation: 488
For single-threaded conversion into a BitArray, Bool.(AU)
AU .% Bool
(see edit note) should be efficient:
using Random
using BenchmarkTools
AU = rand(Bool, 10_000, 200_000)
@benchmark Bool.($AU)
BenchmarkTools.Trial:
memory estimate: 238.42 MiB
allocs estimate: 4
--------------
minimum time: 658.897 ms (0.00% GC)
median time: 672.948 ms (0.00% GC)
mean time: 676.287 ms (0.86% GC)
maximum time: 710.870 ms (6.57% GC)
--------------
samples: 8
evals/sample: 1
Edit: I just realized Bool.(AU)
won't work well for you because you are converting from an array of 8-bit integers, not an array of Bools, so Bool.(AU)
requires checking that every element of AU
is 0
or 1
. Instead, use AU .% Bool
, which will just take the least bit of each integer and should have the performance shown above.
Upvotes: 2
Reputation: 2838
Why aren't you trying to make the Julia code look like the Python code if you want to compare their performance? So something like:
rowm = 200000 ÷ 8
cols = 10000
rows = rowm * 8
AU = rand(Int8.(0:1), rows, cols)
A = zeros(UInt8, rowm, cols)
function compute!(A, AU)
for i in 1:size(A, 2)
Threads.@threads for j in 1:size(A, 1)
offset = (j-1) * 8 + 1
res = AU[offset , i] << 7
res |= AU[offset+1, i] << 6
res |= AU[offset+2, i] << 5
res |= AU[offset+3, i] << 4
res |= AU[offset+4, i] << 3
res |= AU[offset+5, i] << 2
res |= AU[offset+6, i] << 1
res |= AU[offset+7, i]
A[j, i] = res % UInt8
end
end
end
Note that Julia is column-major so the order of indices should be swapped. And you need to explicitly start julia with multiple threads to have the multi-threading be useful (julia -t8
on Julia 1.6).
Upvotes: 3