George Ogden
George Ogden

Reputation: 835

vectorize vs parallelize in Mojo

According to the docs:

vectorize

vectorize[simd_width: Int, func: fn[Int](Int) capturing -> None](size: Int)

Maps a function which is parametrized over a simd_width over a range from 0 to size in simd fashion.


parallelize

parallelize[func: fn(Int) capturing -> None]()

Executes func(0) … func(N-1) as sub-tasks in parallel and block until completion.

Can someone please share some insights on which to use and when as well as how and why they both result in such significant speedups?

Upvotes: 3

Views: 1807

Answers (2)

vega
vega

Reputation: 2189

Like @Jarthon, I'm no expert and fairly new to Mojo, but I'll attempt a complimentary answer. Vectorize and parallelize work totally differently, but they both try to accelerate computation by doing more work simultaneously.

vectorize is all about exploiting CPU ops and registers. For simplicity, let's say your CPU can handle 4 ops per cycle (instruction) and has a floating-point register large enough to handle 8 float numbers. Say you want to compute the dot product of [1.2 2.2 4.5 5.6] * [2.1 3.3 5.5 4.4]. The number of operations is 2*4-1=7, since you have 4 multiplications and then 3 additions, or 2p-1 FLOPS. Naive code loops through the vector to multiply the elements, one pair at a time and increments the results. That would be 7 CPU cycles. But the CPU can do 4 ops simultaneously! It can do all the multiplications in a single cycle. Vectorization sends the CPU all the data with a single multiplication instruction repeated over all elements, and then a single summation instruction over the resulting vector. Total cycles drops to 2.

parallelize basically distributes independent tasks over many CPUs. It turns out matrix multiplication is just a bunch of dot products done in sequence. But the dot product tasks don't depend on each other, so let's distribute the dot product operations to many CPUs!

Example: Given matrices A and B of dimension M x K and K x N respectively, compute C=AB. C is of dimension M X N. Without optimization, total CPU cycles = 2K-1 (for the dot product) for each element in C (M x N). Total = M*N(2K-1) cycles. For 16 X 16 matrices, that's 7936 CPU cycles.

What if we use a computer with 16 processors with chips that can do 4 simultaneous ops of our vector floats? Vectorize dot product cycles would be: 4+4-1=7 instead of 31. Since we can send instructions to 16 processors at once to do these dot products with parallelize, total time is reduced by 16. Total time cycles is thus: 7*16=112.

Let's say it takes a second per sequential operation.

We've gone from M*N(2K-1)=7936 seconds to do a matrix multiplication, to now 112, which is a 70x speedup.

Upvotes: 2

Jarthon
Jarthon

Reputation: 96

While I'm not an expert in this topic, I do have a bit of insight here from deep diving into mojo over the last few weeks, so I'll give you my understanding of each of these. This ended up being, far, far longer than I anticipated; apologies in advance, and if someone wants to give the cliff notes version of my rambling, might be valuable. Also this is in some sense more of a systems question with mojo flavoring than a mojo specific question. If a systems person contradicts me somewhere, they're almost certainly right :)


First Vectorize. As you point out in your question, the docs explain this as

Maps a function which is parametrized over a simd_width over a range from 0 to size in simd fashion.

so lets break that down:

We have to give it a parameterized function, which is a function that has the @parameter decorator. This means that it gets evaluated at compile time rather than runtime, which lets the compiler make optimizations since we are guaranteeing that certain 'arguments' (the ones in square brackets called parameters) will be the same every time, so no need to keep those flexible and insert them when we call the function.

Then, vectorize is going to take that function, which applies to a single integer and make it apply to a SIMD vector. SIMD stands for Single Instruction, Multiple Data, and what it means is that instead of adding (or multiplying or whatever) each number one at a time, the computer is going to use a piece of the chip that does them all (or chunks of them) at the same time, ideally in the same number of clock cycles as if it was just doing one. Most hardware will support this kind of thing nowadays. However, as a consequence of it being tied to the low level machine hardware, we can't decide to change how long that vector is on the fly, which is why we require that our function be parametric.

The reason that this speeds things up is pretty simple, if before your cpu had to add each number individually and now it can do them in batches of, say, 4 at a time, then it should be about 4 times faster. There are some (probably many) nuances to this (for example if you tried to do sets of 11, most hardware wouldn't be able to do that so it might do it in a batch of 8 and another batch of 3, with leftovers), but thats the general idea.


Next Parallelize. As you say, this is described in the docs as

Executes func(0) … func(N-1) as sub-tasks in parallel and block until completion.

So, in the case of parallelize we're going to provide some function to it, which takes in a number as its argument and returns none. Optionally we can also provide a number which is the number of different tasks we want to do (and is also the maximum number thats going to be given as input to our function). If we don't specify it, then it will default to the number of physical processors on the current hardware.

Whichever function we gave is then going to get run on all of the available processors in, as the name would suggest, parallel. Each of these is independent of the others and only differ by which number the function gets as input. Once each process finishes its version of the function, it will try another (if there are more tasks) or if it is fully done then it will stop and wait. Once all of the different instances of this function have finished up their work, the program can continue running. In cases where you don't want to wait there is an asynchronous version called async_parallelize which is a bit more complicated.

The speed up here happens at a higher level. Instead of reducing the number of instructions that the computer needs to execute, as with vectorize, this spreads out the instructions that need to be executed over multiple 'workers.' There are only some situations where this is actually helpful or even possible, chiefly where you're running the same function over and over again with slightly different inputs, but that happens a lot more than you might think, especially inside for loops.


Now, some context on when and where you might use these to get speedups in your code. This is a pretty complex topic and many people smarter than me have figured out ways to apply these in non-standard situations, but heres some general rules of thumb.

Vectorization is something that is most obviously useful when you're doing simple operations on long arrays of numbers, but when its possible its usually worthwhile. There is a big reason that pretty much ever answer on the fastest way to do X with a python array on this website will tell you to use built in numpy functions. Those are all well vectorized at a low level, and when properly used its basically free to do. Setting it up yourself is sometimes a pain and built in functions will usually do it for you, but if speed is important and you're working with big arrays of numbers, it should probably be involved somewhere. If you are doing it yourself however, picking the size of vector that should be used is very important and the optimal choice will vary from hardware to hardware and program to program. Mojo even has a set of tools to help automatically select which one to use.

Parallelization is a bit more complicated. The speed ups can be great and it can apply in more varied (and much higher level/larger) settings, but it isn't 'free' in the same way. There are computational costs associated with setting up the data on all the different processors, keeping track of which processor should do what, which is finished, merging the results at the end in some cases, etc... If you use them poorly they can and will actually slow down your code. In some programming languages it can also cause some of the most nightmarish bugs known to man (I did my first explicit parallelization in C and memory errors are bad enough without your code sometimes failing and sometimes not based on an internal flip of the coin). Luckily this should be much, much better in Mojo since it's memory safe and designed with parallelization in mind, but still be cautious.

In either of these cases, benchmarking is going to be your best friend. Generally speaking you should get your code working before trying to speed it up, and once you do converting it and checking what is actually going to speed up your program is critical.


Whew. This ended up way to long. Sorry to give you so much of a read. I hope some of this answers your question and proves helpful, and I'll try to clarify anything thats confusing if you let me know. If you'd like some more practical examples of the utility of these, Mojo has a matrix multiplication algorithm that they optimize here using both vectorization and parallelization.

Upvotes: 5

Related Questions