Reputation: 345

How to use 128bit float and complex numbers in OpenCL/CUDA?

I need to use 128 bit floating point numbers and complex numbers in parallel GPU computing using OpenCL or CUDA.
Are there any ways to achieve this without implementing it yourself?

I looked at the OpenGL and CUDA specifications and found no float128 support there, is it really impossible for me to use float128 in them? I tried to look for any libraries, but it seems that they do not exist, is that so?

At least I would like to be able to use float128, is it possible to achieve this?

Upvotes: 1

Answers (3)

aland

Reputation: 5209

For NVIDIA Blackwell (compute capability 10.0) and later devices, CUDA 12.8 introduced the support for __float128 / _Float128 in device code and extended the math library with quad-precision functions: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH__QUAD.html.

To use these functions, include the header file device_fp128_functions.h in your program.

Note: FP128 device computations require compute capability >= 10.0.

Further details are scarce at the moment (late Jan 2025), but at least nvcc -arch sm_100 is happy with __float128 in kernels.

Upvotes: 2

user3666197

Reputation: 1

I need to use 128 bit floating point numbers and complex numbers in parallel GPU computing using OpenCL or CUDA. Are there any ways to achieve this without implementing it yourself?

No. Not in 2024-Q3, not counting FPGA-s or SoC-s or brave hobbyists resuscitated Transputer fabrics workarounds for other, hardware-parallel, ways possible.

The answer is dedicated to WHY it is a bad idea to use FP-number representations in arithmetic methods, that help decide any critical thing.

Many safety-critical industries do not and must not degrade due to poorly handled ( or unhandled at all, like in many "fast"-CFD "solvers" ) numerical methods and similar cases, where number-representation decides success from failures ( nice pictures about physical non-sense are not reliable results in life-critical disciplines ).

Pictures may please uneducated eyes, yet if science stays serious about values ( indivisible from their respective RANGE of PRINCIPAL uncertainty - not only for measured values, even best known values of foundational const-s introduce RANGE of PRINCIPAL uncertainty ~ +/- 1E-8 ), many algo-s soon yield results like N +/- oo !

Knowing "WHY" is infinitely more valuable, than knowing "THAT" :

Prof. Pospisil, advisor to several US presidents, used to claim "Serious science creates understanding ( based on repeatable experiments ). That allows generalisations."

I am trying to draw a Mandelbrot set and as its zoom increases, double becomes very quickly missing, I need more numbers after the point.

_{By Claude Heiland-Allen - Own work, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=47022590}

Facts first :
(a)
Mandelbrot set is not a "picture" as we know it, it is an infinitely complex mathematical object, which has a surprisingly trivial, iterative formulation, yet problems ( as you noted ) start once we try to somehow create its visual ( simplified ) representation, the more as you try to "dive" deeper and deeper, as hardware-related number-representation constraints start to create problems for us, to still compute somehow meaningful values.

(b)
As Kahn has proved in 2001, the Mandelbrot set is smoothly connected (!) in a topological sense, so attempts to "rasterise" its simplified visual presentation into 2D-grid creates more and more problems on finer and finer levels of detail ( further complicating need for reasonable number-representation, so problems already contained ad (a) grow even bigger ).

(c)
As the Mandelbrot set is an abstract object, it does not have any physical world constraints ( like passing a Planck length or headbanging into the principal boundary of the Heisenberg's uncertainty principle ), so huge, principally infinite, depths of level of details cannot be avoided.

Good news :

Simplicity of your choice - using only the Mandelbrot set membership-detecting iterator, which tests an iterator-produced value against a preset "giveup"-threshold, you do not need anything more complicated, then ADD()-, SUB()- and MUL()- algebraical methods implemented ( to use any kind of number-representation of your further choice ... at this moment, you should have already realised, that as the float64 soon degrades, the same will float128 and a custom number-representation will be the only way forwards to "deeper" zooms in graphing the 2D-sections of the Mandelbrot set ).

No matter what custom-defined number representation you will opt in, the membership-detecting iterator just uses these simple calculations, derived from standard notation of iterative generator
( z_n+1 = z_n² + c ) :

z_RE := c_RE + ( z_RE + z_IM ) * ( z_RE - z_IM )
z_IM := c_IM + ( z_RE * z_IM * 2 )

and test, if :

( z_RE * z_RE + z_IM * z_IM ) > a_giveup_threshold_SquaredCONST

So we need only :

8 registers ( instances of _{( almost )} infinite precision storage - only [SPACE] is our limit here ), and
4 operations ( ADD, SUB, MUL, > ), composable from 64-bit, 32-bit or even 16-bit hardware uops on smart used bitfields.

For doing this, the math curriculum of first year of high school, Lyceum or secondary school, where polynomials were taught for even polynome-by-polynome divisions ( which we do not need here ), is all you need for creating a trivial or more sophisticated ( variable-length ) number-representation, that could be processed in parallel, smart-using SIMD instructions of choice ( even on GPU ).

Bad news :

Using good tools, for things they have been invented for, usually helps us getting the results in that problem-domain there. Using tool, which is good for something else than we need to solve, need not help us to solve our problem more easily.

This is the case with the O/P statement

"(...) use 128 bit floating point numbers and complex numbers in parallel GPU computing using OpenCL or CUDA."

GPU SM-engines, CUDA, OpenCL were designed to excel at delivering fast, trivial-mathematical-kernels executions, doing transformations over things like RGB 8-bit colorplanes ( texturing, local-smoothing et al ).

If we were to type this :

مطرقة ماسلو: "عندما تكون المطرقة هي الأداة الوحيدة التي تمتلكها، فجأة تصبح كل الأشياء تشبه المسمار..."

using Bulgarian, Sanskrit or similar keyboards, we would fail as miserably, as we will with trying to "enforce" GPU start working with Complex128 or Float128 values as efficiently, as it works with 4-bit ( in inference TPU-s ) or similarly fixed-bit-depth optimised on-board SM + memory hardware + programming-language transformers.

All that does not mean you cannot get _{( almost )} infinitely deep views into Mandelbrot set sections.

You just will have to design a smart-code for { ADD, SUB, MUL, < }-operators for your new, custom-defined number-representation, that will not degenerate on deeper zooms.

All this is doable, using fixed- or even variable-length number-representation, even to get ported onto the COTS GPU-fabrics, yet also be aware that as you most probably will still resort to some kind of discrete coverage of points inside a zoomed-in 2D-section, the GPU will not boost performance as your expectations might be, as the thread-coherency will by definition soon explode your warp-wide threads out from "Coherent-scheduler"-mode, and threads will principally finish in divergent mode ( as their iterator "mileage" will vary a lot ) so the GPU SM scheduler will soon fallback into so called "Greedy-mode", where the hardware, optimised for latency masking, will perform awfully worse than for "Coherent-mode" warp-wide SIMD number-crunching ( Amdahl's Law and GPU latency-masking details matter, a lot ).

Bonus part :

My first Mandelbrot implementation and number-representation naive experiments started some 40 years ago, on 8-bit vehicles Commodore C-128 ( 128 kB RAM was a luxurious space those days (!) ) and Sir Clive Sinclair's ZX Spectrum ( 48 kB RAM ), both using PAL-TV as display.

Today, using the syntax of :

ffplay -f  lavfi        \
       -fs -i mandelbrot \
       -vf "format=gbrp,split=4[a][b][c][d],[d]histogram=display_mode=0:level_height=244[dd],[a]waveform=m=1:d=0:r=0:c=7[aa],[b]waveform=m=0:d=0:r=0:c=7[bb],[c][aa]vstack[V],[bb][dd]vstack[V2],[V][V2]hstack"

we can do a lot more and reveal new levels of detail and demonstrate further problems :

How would one defend slowing down the computation process?
How did mathematically imprecise implementation "end"?
How will mathematically precise implementation "end" and why?
What arguments defend the structures, observed in orthogonal histogram views, are more "properties" of projected, smoothly connected (as proved above) topological structure, than artificial artifacts emerging from chosen number-representation weaknesses?

Upvotes: -3

ProjectPhysX

Reputation: 5754

None of the modern GPUs/CPUs support FP128. The GPUs only have circuitry for FP32 and very limited to no support for FP64. Neither OpenCL nor CUDA support FP128.

You have to implement the format yourself, conversion and arithmetic emulated as a struct of 2 64-bit integers. Same for complex numbers.

I have super fast FP16<->FP32 conversion algorithms here, they are adaptable to FP64<->FP128. For the arithmetic you have to find a solution.

What do you even need 34 decimal digits for? Is there a way to get the same done with FP64? Quite often you can do numeric trickery to avoid digit extinction with lower precision formats.

Also, have a look at Posit formats. If your application only does arithmetic close to the number 1, 64-bit Posit is way more capable than FP64 and could be good enough.

Upvotes: 5

How to use 128bit float and complex numbers in OpenCL/CUDA?

Answers (3)

Knowing "WHY" is infinitely more valuable, than knowing "THAT" :

Related Questions