SBCL optimization: Can we compile an efficient population count for bit-vectors?

Question

SBCL, the Lisp implementation I use, knows to compile (logcount x) to the x86-64 POPCNT instruction if x is typeable as a sufficiently short unsigned-byte. Presuming that a simple-bit-vector gets stored by SBCL in a word-aligned continuous memory segment, how might one get the compiler to perform a similarly optimized population count thereon? The standard doesn't provide a (bit-vector-logcount) (an odd omission, in my opinion); nor does it allow me to (coerce) to a fixnum.

An intentionally naïve implementation is as follows; note that there is also a Ω(1)-time one, while the Harley-Seal technique [1] may be a better choice for large vectors. But this is simple enough for an optimizing compiler to spot the intent:

(defun bit-vector-unsigned-logcount (x)
  "Not worrying about negative interpretations of X."
  (declare (type (simple-bit-vector 32) x)
           (optimize (speed 3) (safety 0) (debug 0)))
  (loop for b across x
        count (not (zerop b))))

On SBCL 2.0.1 I get this:

; disassembly for BIT-VECTOR-UNSIGNED-LOGCOUNT
; Size: 67 bytes. Origin: #x52B88079                          ; BIT-VECTOR-UNSIGNED-LOGCOUNT
; 79:       31D2             XOR EDX, EDX
; 7B:       31C0             XOR EAX, EAX
; 7D:       31C9             XOR ECX, ECX
; 7F:       EB2C             JMP L1
; 81:       660F1F840000000000 NOP
; 8A:       660F1F440000     NOP
; 90: L0:   488BD0           MOV RDX, RAX
; 93:       48D1FA           SAR RDX, 1
; 96:       480FA35301       BT QWORD PTR [RBX+1], RDX
; 9B:       19D2             SBB EDX, EDX
; 9D:       83E202           AND EDX, 2
; A0:       4883C002         ADD RAX, 2
; A4:       4885D2           TEST RDX, RDX
; A7:       7404             JEQ L1
; A9:       4883C102         ADD RCX, 2
; AD: L1:   4883F840         CMP RAX, 64
; B1:       7CDD             JL L0
; B3:       488BD1           MOV RDX, RCX
; B6:       488BE5           MOV RSP, RBP
; B9:       F8               CLC
; BA:       5D               POP RBP
; BB:       C3               RET

I will give the SBCL manual a chance to speak.

If your system's performance is suffering because of some construct which could in principle be compiled efficiently, but which the SBCL compiler can't in practice compile efficiently, consider writing a patch to the compiler and submitting it for inclusion in the main sources.

I suspect I am facing such a case, and I'd be pleased to oblige, but I know almost nothing about VOP hacking beyond having looked at the articles by Paul Khuong here and here.

x86-64/arith.lisp defines a couple of VOPs, unsigned-byte-64-count and positive-fixnum-count, which look as though they could be repurposed for the job if we could tease the bit-vector apart.

[1] Muła, W., Kurz, N., & Lemire, D. (2017). Faster Population Counts Using AVX2 Instructions. The Computer Journal, 61(1), 111–120. doi: 10.1093/comjnl/bxx046

SBCL optimization: Can we compile an efficient population count for bit-vectors?

Answers (1)

Related Questions