Darren Engwirda
Darren Engwirda

Reputation: 7015

Prefetch for Intel Core 2 Duo

Has anyone had experience using prefetch instructions for the Core 2 Duo processor?

I've been using the (standard?) prefetch set (prefetchnta, prefetcht1, etc) with success for a series of P4 machines, but when running the code on a Core 2 Duo it seems that the prefetcht(i) instructions do nothing, and that the prefetchnta instruction is less effective.

My criteria for assessing performance is the timing results for a BLAS 1 vector-vector (axpy) operation, when the vector size is large enough for out-of-cache behaviour.

Have Intel introduced new prefetch instructions?

Upvotes: 5

Views: 4442

Answers (3)

camelccc
camelccc

Reputation: 2992

I've tried this once on a tight loop I was trying to optimize that loaded 4 doubles and did about 15 floating point operations per loop. I found that to have a positive effect on a core 2 duo, the prefetch needed to be set for at least 16 loops ahead in the code, where for older processors 4 loops ahead was enough.

Upvotes: 1

PhiS
PhiS

Reputation: 4650

I don't know whether it might be an issue with your code, but consider that the cache line size (which determines the stride size for use with prefetch instructions) may vary between different processors. Therefore, if you use code which is optimised under the assumption of a different cache line size on a CPU where this assumption isn't met, it's bound to deteriorate performance.

This question here asked how to determine prefetch cache line size.

Upvotes: 1

Yannick Motton
Yannick Motton

Reputation: 35991

From an Intel reference document on Intel 64 and IA-32 Architectures, check out page 163 and 77:

Pentium 4 and Intel Xeon processors based on Intel NetBurst microarchitecture introduced hardware prefetching in addition to software prefetching. The hardware prefetcher operates transparently to fetch data and instruction streams from memory without requiring programmer intervention. Subsequent microarchitectures continue to improve and add features to the hardware prefetching mechanisms. Earlier implementations of hardware prefetching mechanisms focus on prefetching data and instruction from memory to L2; more recent implementations provide additional features to prefetch data from L2 to L1. In Intel NetBurst microarchitecture, the hardware prefetcher can track 8 independent streams.

The Pentium M processor also provides a hardware prefetcher for data. It can track 12 separate streams in the forward direction and 4 streams in the backward direction. The processor’s PREFETCHNTA instruction also fetches 64-bytes into the firstlevel data cache without polluting the second-level cache.

Intel Core Solo and Intel Core Duo processors provide more advanced hardware prefetchers for data than Pentium M processors. Key differences are summarized in Table 2-10.

Upvotes: 4

Related Questions