Superpronker
Superpronker

Reputation: 319

Why is MATLAB so slow on my Windows server?

I have Matlab R2017a installed on a server running MS Windows Server 2008 R2 Enterprise v 6.1 (SP1) and the benchmark results are awful:

bench
3.6424   0.5267   0.2114   5.0303   1.5557   3.4980

[columns = LU, FFT, ODE, Sparse, 2-D, 3-D]

Note that it is particularly slow for LU and Sparse.

The server has this hardware: CPU: Intel Xeon E7320 @ 2.13GHz (4 physical processors, 16 logical) 128 GB RAM 64-bit operating system

Matlab Version: 9.2.0.556344 (R2017a) Java version: Java 1.7.0_60-b19 with Oracle corporation Java Hotspot(TM) 64-Bit Server VM mixed mode.

There are also other users that can be online on the server but I can see that they are not stressing the system and have verified that these running times are stable (have tested multiple times the past week.

My question is this: is there any other library or something that Matlab relies on that could be "wrong"? I have another similar setup on a similar but slightly newer server that gets bench results much closer to what I'd expect based on the specs. I'm wondering if it's using a "wrong" linear algebra module or something.

Alternative explanation I know that Matlab ran extremely slowly on a particular AMD Opteron CPU (I happen to also have worked on such a server in Matlab, link https://se.mathworks.com/matlabcentral/answers/33939-poor-matlab-performance-on-amd-based-computer). Is it possible that it's a similar issue with the Intel Xeon E7320?

Edit: Xeon E7320 as suggested by Peter.

Upvotes: 2

Views: 839

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 364593

Update: I'm not sure whether Matlab's bench takes advantage of just a single CPU core, multiple CPU cores, or also a GPU (OpenCL / CUDA). If it can use GPU acceleration, that would make a huge difference. (Especially if you don't have one at all in your "slow" server).

As discussed in comments, a dual-core Sandybridge laptop is 10x faster on some of the benchmarks, but only 2 or 1.5x faster on some other components. (But I'm not sure if the version of Matlab was controlled for; that thread you linked mentioned that different version of Matlab do a different amount of work in their bench).


The rest of this answer was written with the assumption that your test takes advantage of all your CPU cores (otherwise there's no point using an old many-core machine). But without considering GPU.


I think your CPU is actually a 65nm Core2-based Xeon E7320, not "E3720" (no google hits). What are you comparing against? Your Tigerton CPUs are ancient (about 10 years old), of course they're slow. (Tigerton is the same microarchitecture as Conroe/Merom, first-gen Core2).

You have very low memory bandwidth and cache speeds compared to a modern CPU, as well as only having SSSE3, not AVX or FMA. These CPUs don't have a memory controller built-in, so all 4 sockets are sharing the memory controller hub (MCH) via separate 1066MHz Front-Side Buses. Memory bandwidth doesn't scale with number of sockets, and is not great. Memory bandwidth has grown faster than per-core performance over the years. According to that link, a quad-socket 16-core Tigerton (like you have) is barely better than a quad-socket 8-core Barcelona Opteron. It's not so bad for CPU-bound workloads, but memory-bound workloads will do quite badly.


As well as the low clock speed, it's significantly slower clock-for-clock than a modern CPU. IDK what those times are supposed to be like (I'm here for the [performance] tag, not Matlab), but it's totally plausible that a 3GHz quad-core i5 or i7 Haswell / Skylake desktop or high-power laptop would be faster than your 16-core dinosaur machine.

(Actually, does that benchmark even scale with the number of cores? If not, the single-threaded memory bandwidth is probably really not good.)

A very big jump in performance happened with Sandybridge (for all code, including non-SIMD workloads), and there were several other smaller jumps in between your machine and modern CPUs as well. SnB can run 2 load instructions per clock, vs. 1 for previous Intel (like your Core2).

For FP-specific stuff that optimized libraries will take advantage of, x86 ISA extensions have been important: AVX doubles the SIMD vector width, doubling FLOPS (on Intel CPUs with full-width execution units). FMA does a mul+add in one instruction, potentially doubling FLOPS. Microarchitectural improvements are important, too: Haswell has two FMA units vs. earlier CPUs having one FP adder and one FP multiplier, again potentially doubling FLOPS. Only contiguous memory and high computation vs. memory workloads will fully take advantage of this, e.g. a dense matmul, but in that case one Haswell core is doing as much work as 8 Tigerton cores.

I assume Matlab can take advantage of AVX + FMA if the CPU has it.


And BTW, it's not just 16 "logical" processors. You don't have hyperthreading, so you have a 4-socket system with four quad-core CPUs, for 16 physical cores. (And these "quad core" chips are actually two separate dual-core dies in the same package, according to wikipedia.

So as far as the number of physical chips that need to communicate with each other, there are 8 (two in each package). That's a lot of hops to reach other CPUs, so synchronization between cores is more expensive than for a single-die quad-core. (And probably worse even than a modern dual-socket Xeon box with a pair of 18-core CPUs or something).

Note that high latency to memory can also hurt memory bandwidth: see the "latency bound platforms" part of this answer about optimizing memcpy/memset and how store bandwidth works in Intel CPUs.

Upvotes: 4

Related Questions