Asain Kujovic
Asain Kujovic

Reputation: 1829

x264 library speed - Altivec vs SSE4 -

I have simple cheap dualcore intel-3ghz-debian and access to super-expensive powerPc7-Aix.

And after few days of strugle, i compiled libx264 and tested it on both computers:

  1. GCC: library x264 on intel (with SSE2 capabilities) and
  2. GCC on 16 core powerPc (with altivec).

... and result is that cheap intel is x2 times faster ! (with altivec disabled, intel is 10x times faster)

My question: is this normal? Does all other powerPC-users have same results? Can powerPc-altivec-optimisation of x264 library work at same speed with intel... or MMX/SSE optimisation is officially at least 2 times faster for this library?

I am not interested in multi-thread options. Number of cores and threads are irrelevant. Just simple one-thread x264 encoding with default "medium preset" using rawvideo as source, sse vs altivec.

Maybe native Aix XLC compiler provide better results? (i managed only gcc to work)

... mac-powerpc-users maybe know something about this.

powrPc7-Aix:$ time (cat raw10sec.y4m |x264 --input-res 720x576 --fps 50 -o /dev/null -)
x264: 64-bit XCOFF
x264 [info]: using cpu capabilities: Altivec
time: real 0m33.559s
---
intelDebian:$ time (cat raw10sec.y4m |x264 --input-res 720x576 --fps 50 -o /dev/null -)
x264: ELF 32-bit LSB executable
x264 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 FastShuffle SSE4.1 Cache64
time: real 0m16.503s

Upvotes: 2

Views: 2381

Answers (1)

tc.
tc.

Reputation: 33592

A few things spring to mind:

  • GCC has likely had much more effort put into optimizing x86 (specifically commodity Intel/AMD parts) than other architectures, possibly all other architectures combined.
  • x264 may similarly have had more effort put into optimizing x86/SSE.
  • Your question says SSE2, but x264 says it's using SSE4.1. There's a big difference there!
  • MMX/SSE was initially targeted towards things Intel thought mattered, with many specialized instructions and quirks (e.g. there are different instructions for floating point and integer loads, despite the fact that they load the same memory into the "same" register). AltiVec seems much more orthogonal, but as a result, may be less good at the things that MMX was designed to be good at.
  • Even assuming that AltiVec/SSE are largely equivalent, you haven't mentioned clock speed and instructions-per-clock.
  • The PPC is partly expensive because you're paying for 16×4 threads — it's not uncommon to want to pack as much as possible onto a single chip for server/HPC applications. It's slightly embarrassing that a collection of commodity parts is often faster and cheaper (sometimes even accounting for lifetime electricity costs), but that's the way things are headed.

A more interesting comparison would be against a PS3 with code optimized to take advantage of all cores — apparently PS3s are great at bruteforcing crypto. Sadly they've stopped making them, and I don't know how easy it is to run Linux on one these days.

Upvotes: 1

Related Questions