Marcus Wu
Marcus Wu

Reputation: 47

Why Xeon Phi always got bad efficacy?

I tried to run a for loop 1,000,000,000 times on Xeon E5 and Xeon Phi, and measurement time to compare their efficacy, I'm so surprise I got the following result:

Can anybody tell me that why I get the bad efficacy? About architecture or any another?

Why I got the bad efficeny on Xeon Phi? I do nothing on the for loop. If my Xeon Phi coprocessor didn't had any problem, what work for Xeon Phi is great? Must be vectorization? if not vectorization, can I do any thing on Xeon Phi use its threads to help me something?

Upvotes: 1

Views: 1357

Answers (3)

Vahid Noormofidi
Vahid Noormofidi

Reputation: 818

First, you have to utilize the entire chip, i.e., utilize SIMD units as well. Second, in order to utilize the Xeon Phi processor, the pipeline must not remain idle, i.e., there has to be always enough instruction inside the pipeline. In your benchmark no instruction is issued, so you basically measured the launch of an empty loop (which is likely optimized out by your compiler) and due to CPU's higher clock, runs faster on CPU.

In addition, in my benchmarks I found that the Xeon Phi's performance is very sensitive to the length of the innermost loop (that runs on SIMD units).

Upvotes: 1

Computer architect
Computer architect

Reputation: 49

Xeon Phi sucks. In moderately parallel applications traditional xeons trounce xeon Phi, in massively parallel applications GPGPUs rule. Xeon Phi is only marginally competitive when you can perfectly parallelize AND vectorize your application if either one is not perfect forget Xeon Phi.

EDIT: Some examples where xeon phi works either worse than traditional xeons or worse than GPGPUs:

blog.xcelerit.com/intel-xeon-phi-vs-nvidia-tesla-gpu/

http://www.delaat.net/awards/2014-03-26-paper.pdf

https://verc.enes.org/ISENES2/documents/Talks/WS3HH/session-4-hpc-software-challenges-solutions-for-the-climate-community/markus-rampp-mic-experiences-at-mpg

Upvotes: 2

Taylor Kidd
Taylor Kidd

Reputation: 1511

The key is that you say, "I do nothing in the for loop." (Please correct me if I'm mistaken.)

Because of practical limits when the Xeon Phi was created, its cores are based upon a Pentium generation machine with various enhancements, such as dual issue, 4 threads per core, and the 512-bit vector engine. So if you are only running scalar code, it runs like a Pentium.

You need to run code that is both highly parallel and highly vectorizable. Even better if threads running on each core are able to share the core's pipeline without much contention, e.g. DGEMM, as well as take advantage of the cache structure.

By running a trivial benchmark, you are basically comparing the execution of code overhead on both your architectures (Xeon and Xeon Phi). And code overhead is typically scalar.

Here's an exaggerated illustration for us more visually inclined.

|<--Ovr-->|<--Work--------------->| repeat 10^6 times //Xeon Server

|<-----Ovr----->|<-Work->| repeat 10^6 times //Xeon Phi

Where "Ovr" is overhead, and "Work" is your highly threaded and vectorized workload.

If you have "Work" to do, then the Xeon Phi does better. If you remove the "Work", leaving only the overhead, the Xeon does better.

Upvotes: 2

Related Questions