Reputation: 188
SBCL, the Lisp implementation I use, knows to compile (logcount x)
to the x86-64 POPCNT
instruction if x
is typeable as a sufficiently short unsigned-byte
. Presuming that a simple-bit-vector
gets stored by SBCL in a word-aligned continuous memory segment, how might one get the compiler to perform a similarly optimized population count thereon? The standard doesn't provide a (bit-vector-logcount)
(an odd omission, in my opinion); nor does it allow me to (coerce)
to a fixnum
.
An intentionally naïve implementation is as follows; note that there is also a Ω(1)-time one, while the Harley-Seal technique [1] may be a better choice for large vectors. But this is simple enough for an optimizing compiler to spot the intent:
(defun bit-vector-unsigned-logcount (x)
"Not worrying about negative interpretations of X."
(declare (type (simple-bit-vector 32) x)
(optimize (speed 3) (safety 0) (debug 0)))
(loop for b across x
count (not (zerop b))))
On SBCL 2.0.1 I get this:
; disassembly for BIT-VECTOR-UNSIGNED-LOGCOUNT
; Size: 67 bytes. Origin: #x52B88079 ; BIT-VECTOR-UNSIGNED-LOGCOUNT
; 79: 31D2 XOR EDX, EDX
; 7B: 31C0 XOR EAX, EAX
; 7D: 31C9 XOR ECX, ECX
; 7F: EB2C JMP L1
; 81: 660F1F840000000000 NOP
; 8A: 660F1F440000 NOP
; 90: L0: 488BD0 MOV RDX, RAX
; 93: 48D1FA SAR RDX, 1
; 96: 480FA35301 BT QWORD PTR [RBX+1], RDX
; 9B: 19D2 SBB EDX, EDX
; 9D: 83E202 AND EDX, 2
; A0: 4883C002 ADD RAX, 2
; A4: 4885D2 TEST RDX, RDX
; A7: 7404 JEQ L1
; A9: 4883C102 ADD RCX, 2
; AD: L1: 4883F840 CMP RAX, 64
; B1: 7CDD JL L0
; B3: 488BD1 MOV RDX, RCX
; B6: 488BE5 MOV RSP, RBP
; B9: F8 CLC
; BA: 5D POP RBP
; BB: C3 RET
I will give the SBCL manual a chance to speak.
If your system's performance is suffering because of some construct which could in principle be compiled efficiently, but which the SBCL compiler can't in practice compile efficiently, consider writing a patch to the compiler and submitting it for inclusion in the main sources.
I suspect I am facing such a case, and I'd be pleased to oblige, but I know almost nothing about VOP hacking beyond having looked at the articles by Paul Khuong here and here.
x86-64/arith.lisp
defines a couple of VOPs, unsigned-byte-64-count
and positive-fixnum-count
, which look as though they could be repurposed for the job if we could tease the bit-vector
apart.
[1] Muła, W., Kurz, N., & Lemire, D. (2017). Faster Population Counts Using AVX2 Instructions. The Computer Journal, 61(1), 111–120. doi: 10.1093/comjnl/bxx046
Upvotes: 4
Views: 432
Reputation: 426
Can sb-kernel:%vector-raw-bits be the missing piece?
(logcount (sb-kernel:%vector-raw-bits #*100101010 0))
-> 4
(disassemble (compile () (lambda (a) (logcount (sb-kernel:%vector-raw-bits a 0)))))
-> (...) POPCNT (...)
How to find: check, e.g., bit-and
, and jump to definition (M-. in sly/slime repl), select the deftransform
for simple-bit-vectors
.
Needless to say this is implementation specific and needs some care to be safe, but so are vops.
Upvotes: 2