Will
Will

Reputation: 75645

Prefetch instructions on ARM

Newer ARM processors include the PLD and PLI instructions.

I'm writing tight inner loops (in C++) which have a non-sequential memory access pattern, but a pattern that naturally my code fully understands. I would anticipate a substantial speedup if I could prefetch the next location whilst processing the current memory location, and I would expect this to be quick-enough to try out to be worth the experiment!

I'm using new expensive compilers from ARM, and it doesn't seem to be including PLD instructions anywhere, let alone in this particular loop that I care about.

How can I include explicit prefetch instructions in my C++ code?

Upvotes: 4

Views: 7063

Answers (5)

Ali Rehman
Ali Rehman

Reputation: 1

I Don't think you can do the prefetch/ cache management in c++ code. You can include prefetch in assembly code. There is a instruction name PERFM for arm v8 or above instruct set.Also you need to take care of address translation if it's virtual address.

Upvotes: 0

Matt J
Matt J

Reputation: 45203

It is not outside the realm of possibility that other optimizations like software pipelining and loop unrolling may achieve the same effect as your prefetching idea (hiding the latency of the loads by overlapping it with useful computation), but without the extra instruction-cache pressure caused by the extra instructions. I would even go so far as to say that this is the case more often than not, for tight inner loops that tend to have few instructions and little control flow. Is your compiler doing these types of traditional optimizations instead. If so, it may be worth looking at the pipeline diagram to develop a more detailed cost model of how your processor works, and evaluate more quantitatively whether prefetching would help.

Upvotes: 0

Dan
Dan

Reputation: 10393

At the risk of asking the obvious: have you verified the compiler's target architecture? For example (humor me), if by default the compiler is targeted to ARM7, you're never going to see the PLD instruction.

Upvotes: 0

Loren Charnley
Loren Charnley

Reputation: 233

If you are trying to extract truly maximum performance from these loops, than I would recommend writing the entire looping construct in assembler. You should be able to use inline assembly depending on the data structures involved in your loop. Even better if you can unroll any piece of your loop (like the parts involved in making the access non-sequential).

Upvotes: 1

Ionut Anghelcovici
Ionut Anghelcovici

Reputation:

There should be some Compiler-specific Features. There is no standard way to do it for C/C++. Check out you compiler Compiler Reference Guide. For RealView Compiler see this or this.

Upvotes: 5

Related Questions