user34537
user34537

Reputation:

Why is struct{u64} faster than struct{u32,u32}?

Huh, ideone shows it slower. In g++ 4.9.3 in my VM (ubuntu werewolf) I get A: 830 B: 460. In either case why is one faster than the other? I assumed A is being push by reference and B is by value but I don't see why it would do that. I tried using operator<(A a, ... and that didn't help.

struct A {
    u32 first, second;
    A(int f, int s):first(f),second(s) {}
    bool operator<(u32 b) const { return first < b; }
};
//bool operator<(const A&a, u32 b) { return a.first < b; }

struct B {
    u64 v;
    B(int f, int s) { v = ((long)f << 32) | s; }
    bool operator<(u32 b) const {
        return (v >> 32) < b;
    }
};
u32 get_value(deque<A>& d, u32 v) {
    auto p = lower_bound(d.begin(), d.end(), v);
    if (p != d.end())
        return p->second;
    else
        return UINT_MAX;
}
u32 get_value(deque<B>& d, u32 v) {
    auto p = lower_bound(d.begin(), d.end(), v);
    if (p != d.end())
        return p->v & 0xFFFFFFFF;
    else
        return UINT_MAX;
}

int main(int argc, char *argv[]) {

    {
        deque<A> d;
        struct timeval s, e;
        gettimeofday(&s, 0);
        for (int i = 0; i < 1024LL * 1024 * 1024 * 3 / 32; ++i)
            d.emplace_back(A(i, i ^ 92142));
        long v = 0;
        for (int i = 0; i < 10000; ++i)
            v += get_value(d, i * 3 + 1);

        gettimeofday(&e, 0);
        auto sec = e.tv_sec - s.tv_sec;
        auto usc = e.tv_usec - s.tv_usec;
        printf("A %ld\n", v);
        printf("Time: %lu\n", sec * 1000 + usc / 1000);
    }
    {
        deque<B> d;
        struct timeval s, e;
        gettimeofday(&s, 0);
        for (int i = 0; i < 1024LL * 1024 * 1024 * 3 / 32; ++i)
            d.emplace_back(B(i, i ^ 92142));
        long v = 0;
        for (int i = 0; i < 10000; ++i)
            v += get_value(d, i * 3 + 1);

        gettimeofday(&e, 0);
        auto sec = e.tv_sec - s.tv_sec;
        auto usc = e.tv_usec - s.tv_usec;
        printf("A %ld\n", v);
        printf("Time: %lu\n", sec * 1000 + usc / 1000);
    }
}

Upvotes: 2

Views: 543

Answers (3)

Peter Cordes
Peter Cordes

Reputation: 365707

I think you're getting a speedup with struct B { u64 v; } because you're using it with compile-time-constant args. The compiler can combine both values into a 64bit store.

I made a simple non-member function to get asm output produced for the non-inline case of operater< for A and B. As godbolt shows, the asm for struct A::operator< is significantly more efficient. B::operator< really does do a 64bit load, then shift. I commented the asm for the benefit of people that don't know x86 very well.

bool comp_A(const struct A &lhs, u32 rhs) { return lhs < rhs; }
bool comp_B(const struct B &lhs, u32 rhs) { return lhs < rhs; }

comp_A(A const&, unsigned int):
    cmpl    %esi, (%rdi)  # compare with lhs->first as a memory operand
    setb    %al
    ret
comp_B(B const&, unsigned int):
    movq    (%rdi), %rax  # load lhs->v
    movl    %esi, %esi    # zero-extend the low 32b of rhs
    shrq    $32, %rax     # shift the high32 of v down to the low 32
    cmpq    %rsi, %rax
    setb    %al           # al=1 if rax "bigger than" rsi, else 0.
    ret

x86 has fast hardware support for loading any size integer from RAM into a 32 or 64bit temporary (in a register). movzx and movsx zero or sign extend, and are as cheap as regular loads with mov. 32bit ops always zero the high 32 of the dest register, so false dependencies on old values aren't a problem. (They are for 16 and 8bit ops, which is why movzx to a 32bit temporary is a good plan.)

I haven't looked at the asm of your actual functions, as it's quite large. I'd suggest using version A in general, though. It may be that gcc isn't copying it around with 64bit moves, maybe because it might not be aligned? x86 doesn't require alignment (except for non-AVX (legacy-SSE) 16byte memory operands). However, IIRC it's only since about 2008 or 2009 CPUs (Intel's Nehalem) that unaligned loads/stores had no penalty as long as they didn't cross a cache line. gcc may still be reluctant to use them, since the potential gain is smaller than the potential downside on old CPUs with slow unaligned access.


You might get gcc to give you the best of both worlds with a union.

union A { u64 v; u32 first_last[2]; };

This might induce the compiler to use 64bit moves when copying it around, but still do 32bit loads of A.first_last[0] and not need to shift when you're accessing individual fields.

Upvotes: 2

P. Hinker
P. Hinker

Reputation: 297

Getting accurate high-resolution timings is very hard. The gettimeofday function is not guaranteed to be microsecond resolution and can be effected by many other system processes (ntpd, multi-threads, etc.). One thing that can get you closer is to use a tick-timer if you're on a machine that supports reading the clock register (like Intel processors).

Using your code and a tick-timer I noted that the fastest time moved from one structure to the other (I'm not running on a VM).

#include <cstdio>
#include <deque>
#include <algorithm>
#include <climits>
#include <sys/time.h>
#include <stdint.h>
using namespace std;

typedef unsigned int u32;
typedef unsigned long long u64;
static_assert(sizeof(u32)==4 && sizeof(u64)==8, "Fail");

#define CLOCK_TICKS 2400000000  // my machine clock is 2.4 Ghz

struct A {
    u32 first, second;
    A(int f, int s):first(f),second(s) {}
    bool operator<(u32 b) const { return first < b; }
};
//bool operator<(const A&a, u32 b) { return a.first < b; }

struct B {
    u64 v;
    B(int f, int s) { v = ((u64)f << 32) | s; }
    bool operator<(u32 b) const {
    return (v >> 32) < b;
    }
};
u32 get_value(deque<A>& d, u32 v) {
    auto p = lower_bound(d.begin(), d.end(), v);
    if (p != d.end())
    return p->second;
else
    return UINT_MAX;
}
u32 get_value(deque<B>& d, u32 v) {
    auto p = lower_bound(d.begin(), d.end(), v);
    if (p != d.end())
        return p->v & 0xFFFFFFFF;
    else
        return UINT_MAX;
}

// This function was originally published in the Intel hardware manual
// for the first chips that supported the SSE instruction set.  I've
// seen variations on this for 12+ years and do not know where to
// attribute its origin

 inline uint64_t rdtsc() {
    uint32_t lo, hi;
    __asm__ __volatile__ (
      "xorl %%eax, %%eax\n"
      "cpuid\n"
      "rdtsc\n"
      : "=a" (lo), "=d" (hi)
      :
      : "%ebx", "%ecx");
    return (uint64_t)hi << 32 | lo;
}

int main(int argc, char **argv)
{
    uint64_t s, e;
    double elapsed;
    s = rdtsc();   // warm up the timer
    {
        deque<A> d;

        s = rdtsc();
        for (int i = 0; i < 1024LL * 1024 * 1024 * 3 / 32; ++i)
            d.emplace_back(A(i, i ^ 92142));
        long v = 0;
        for (int i = 0; i < 10000; ++i)
            v += get_value(d, i * 3 + 1);

        e = rdtsc();
        elapsed = ((double)e - (double)s) / (double)CLOCK_TICKS;
        printf("A %ld\n", v);
        printf("Time: %lf\n", elapsed);
    }
    {
        deque<B> d;
        s = rdtsc();
        for (int i = 0; i < 1024LL * 1024 * 1024 * 3 / 32; ++i)
            d.emplace_back(B(i, i ^ 92142));
        long v = 0;
        for (int i = 0; i < 10000; ++i)
            v += get_value(d, i * 3 + 1);

        e = rdtsc();

        printf("B %ld\n", v);
        elapsed = ((double)e - (double)s) / (double)CLOCK_TICKS;
        printf("Time: %lf\n", elapsed);
    }   
}

Upvotes: 0

Blindy
Blindy

Reputation: 67477

Working with register sized variables is always fastest, otherwise you have to deal with potentially unaligned data and cropped operations.

I think it's a safe assumption that ideone is using a 64-bit compiler, so that explains that.

Upvotes: 0

Related Questions