Finlay Roelofs
Finlay Roelofs

Reputation: 570

CPU hashes faster than GPU?

I want to generate a random number, and hash that with SHA256 on my GPU using OpenCL with this base code (instead of hashing those pre-given plain-texts, it hashes the random numbers).
I got all the hashing to work on my GPU, but there is one problem:
the amount of hashes done per second lowers when using OpenCL?

Yes, you heard that correctly, at the moment it's faster to use only the CPU over only using the GPU.
My GPU runs at only ~10% while my CPU runs at ~100%

My question is: how can this be possible and more importantly, how do I fix it?

This is the code I use for generating a Pseudo-Random Number (which doesn't change at all between the 2 runs):

long Miner::Rand() {
    std::mt19937 rng;
    // initialize the random number generator with time-dependent seed
    uint64_t timeSeed = std::chrono::high_resolution_clock::now().time_since_epoch().count();
    std::seed_seq ss{ uint32_t(timeSeed & 0xffffffff), uint32_t(timeSeed >> 32) };
    rng.seed(ss);
    // initialize a uniform distribution between 0 and 1
    std::uniform_real_distribution<double> unif(0, 1);
    double rnd = unif(rng);
    return floor(99999999 * rnd);
}

Here is the code that calculates the hashrate for me:

void Miner::ticker() {
    SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_HIGHEST);
    while (true) {
        Sleep(1000);
        HashesPerSecond = hashes;
        hashes = 0;
        PrintInfo();
    }
}

which gets called from here:

void Miner::Start() {
    std::chrono::system_clock::time_point today = std::chrono::system_clock::now();
    startTime = std::chrono::system_clock::to_time_t(today);
    std::thread tickT(&Miner::ticker, this);
    PostHit();
    GetAPIBalance();
    while (true) {
        std::thread t[32]; //max 32
        hashFound = false;
        if (RequestNewBlock()) {
            for (int i = 0; i < numThreads; ++i) {
                t[i] = std::thread(&Miner::JSEMine, this);
            }
            for (auto& th : t)
                if (th.joinable())
                    th.join();
        }
    }
}

which in turn get's called like this:

Miner m(threads);
m.Start();

Upvotes: 0

Views: 7558

Answers (1)

Dragontamer5788
Dragontamer5788

Reputation: 1987

CPUs have far better latency characteristics than GPUs. That is to say, CPUs can do one operation way, way WAAAAYYYY faster than a GPU can. That's not even taking into account the CPU -> Main RAM -> PCIe bus -> GDDR5 "Global" GPU -> GPU Registers -> "Global GPU" -> PCIe bus back -> Main RAM -> CPU round trip time (and I'm skipping a few steps here, like pinning and L1 Cache)

GPUs have better bandwidth characteristics than CPUs (provided that the dataset can fit inside of the GPU's limited local memory). GPUs can perform Billions of SHA256 hashes faster than a CPU can perform billions of SHA256 hashes.

Bitcoin requires millions, billions, or even trillions of hashes to achieve a competitive hash rate. Furthermore, computations can take place on the GPU without much collaboration with the CPU (removing the need for the slow round-trip through PCIe).

Its an issue of fundamental design. CPUs are designed to minimize latency, but GPUs are designed to maximize bandwidth. It seems like your problem is latency-bound (you're calculating too few SHA256 hashes for the GPU to be effective). 32 is... really really small in the scale we're talking about.

The AMD GCN architecture doesn't even perform full speed until you have at LEAST 64-work items, and arguably you really need 256 work items to maximize just one of the 44-compute units of say... a R9 290x.

I guess what I'm trying to say is: try it again with 11264 work items (or more), that's the number of work items that GPUs are designed to work with. Not 32. I got this number from 44-compute units on R9 290x * 4-Vector units per compute unit * 64-work items per vector unit.

Upvotes: 9

Related Questions