Reputation: 33
System in question is a 2-CPU Xeon server running CentOS with 256 GB RAM:
2 x Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz
Each CPU has 8 cores, so with hyperthreading the system has 32 processors show up in /proc/cpuinfo.
In using this sytem, I noticed some peculiar performance issues in some data processing. The data processing system is built in Python 3.3.5 (environment setup with Anaconda) and spawns a bunch of processes that read data from a file, create some numpy arrays, and do some processing.
I was testing the processing with various numbers of processes spawned. Up to a certain number of processes, performance stayed relatively constant. However, once I got up to 16 processes, I noticed that a numpy.abs() call started taking around 10 times longer than it otherwise should, from around 2 seconds to 20 or more seconds.
Now, total memory usage in this test was not a problem. Of the 256 GB system RAM, htop showed 100+ GB free, and meminfo wasn't showing any swapping.
I ran another test of using 16 processes, but loading less data with total memory use around 75 GB. In this case, the numpy.abs() call was taking 1 second (which is expected since it's half the data). Going to 24 processes, still using less than half of the system ram, the numpy.abs() call likewise took around 1 second. So I was no longer seeing the 10x performance hit.
The interesting thing here is that it does seem like if more than half the system memory is used, performance degrades terribly. It doesn't seem like this should be the case, but I have no other explanation.
I wrote a Python script that sort of simulates what the processing framework does. I've tried various methods of spawining processes, multiprocessing.Pool apply_async(), concurrent.futures, and multiprocessing Process, and they all give the same results.
import pdb
import os
import sys
import time
import argparse
import numpy
import multiprocessing as mp
def worker(n):
print("Running worker", n)
NX = 20000
NY = 10000
time_start = time.time()
x1r = numpy.random.rand(NX,NY)
x1i = numpy.random.rand(NX,NY)
x1 = x1r + 1j * x1i
x1a = numpy.abs(x1)
print(time.time() - time_start)
def proc_file(nproc):
procs = {}
for i in range(0,nproc):
procs[i] = mp.Process(target = worker, args = (i, ))
procs[i].start()
for i in range(0,nproc):
procs[i].join()
if __name__ == "__main__":
time_start = time.time()
DEFAULT_NUM_PROCS = 8
ap = argparse.ArgumentParser()
ap.add_argument('-nproc', default = DEFAULT_NUM_PROCS, type = int,
help = "Number of cores to run in parallel, default = %d" \
% DEFAULT_NUM_PROCS)
opts = ap.parse_args()
nproc = opts.nproc
# spawn processes
proc_file(nproc)
time_end = time.time()
print('Done in', time_end - time_start, 's')
Some results for various number of processes:
$ python test_multiproc_2.py -nproc 4
Running worker 0
Running worker 1
Running worker 2
Running worker 3
12.1790452003479
12.180120944976807
12.191224336624146
12.205029010772705
Done in 12.22369933128357 s
$ python test_multiproc_2.py -nproc 8
Running worker 0
Running worker 1
Running worker 2
Running worker 3
Running worker 4
Running worker 5
Running worker 6
Running worker 7
12.685678720474243
12.692482948303223
12.704699039459229
13.247581243515015
13.253047227859497
13.261905670166016
13.29712200164795
13.458561897277832
Done in 13.478906154632568 s
$ python test_multiproc_2.py -nproc 16
Running worker 0
Running worker 1
Running worker 2
Running worker 3
Running worker 4
Running worker 5
Running worker 6
Running worker 7
Running worker 8
Running worker 9
Running worker 10
Running worker 11
Running worker 12
Running worker 13
Running worker 14
Running worker 15
135.4193136692047
145.7047221660614
145.99714827537537
146.088121175766
146.3116044998169
146.94093680381775
147.05147790908813
147.4889578819275
147.8443088531494
147.92090320587158
148.32112169265747
148.35854578018188
149.11916518211365
149.22325253486633
149.45888781547546
149.74489760398865
Done in 149.97473335266113 s
So, 4 and 8 processes are about the same, but with 16 processes it is 10 times slower! The noticeable thing is with the 16 process case, memory usage hits 146 GB.
If I reduce the size of the numpy array in half and run it again:
$ python test_multiproc_2.py -nproc 4
Running worker 1
Running worker 0
Running worker 2
Running worker 3
5.926755666732788
5.93787956237793
5.949704885482788
5.955750226974487
Done in 5.970340967178345 s
$ python test_multiproc_2.py -nproc 16
Running worker 1
Running worker 3
Running worker 0
Running worker 2
Running worker 5
Running worker 4
Running worker 7
Running worker 8
Running worker 6
Running worker 11
Running worker 9
Running worker 10
Running worker 13
Running worker 12
Running worker 14
Running worker 15
7.728739023208618
7.751606225967407
7.754587173461914
7.760802984237671
7.780809164047241
7.802706241607666
7.852390766143799
7.8615334033966064
7.876686096191406
7.891174793243408
7.916942834854126
7.9261558055877686
7.947092771530151
7.967057704925537
8.012752294540405
8.119316577911377
Done in 8.135530233383179 s
So, a little bit of a performance hit between 16 and 4 processes, but nothing close to what is being seen with the larger array.
Also, if I double the array size and run it again:
$ python test_multiproc_2.py -nproc 4
Running worker 1
Running worker 0
Running worker 2
Running worker 3
23.567795515060425
23.747386693954468
23.76904606819153
23.781703233718872
Done in 23.83848261833191 s
$ python test_multiproc_2.py -nproc 8
Running worker 1
Running worker 0
Running worker 3
Running worker 2
Running worker 5
Running worker 4
Running worker 6
Running worker 7
103.20905923843384
103.52968168258667
103.62282609939575
103.62272334098816
103.77079129219055
103.77456998825073
103.86126565933228
103.87058663368225
Done in 104.26257705688477 s
With 8 processes now, RAM use hits 145 GB, and there's a 5X performance hit.
I don't know what to make of this. The system becomes basically unusable if more than half of the system memory is being used. But, I don't know if that's just coincidence and something else is to blame.
Is this a Python thing? Or a system architecture thing? Does each physical CPU only play well with half the system memory? Or is it a memory bandwidth issue? What else can I do to try to figure this out?
Upvotes: 2
Views: 304
Reputation: 33
The only thing that resolves the problem is clearing the cached memory. I ran a test that needed just about all 256 GB memory when the OS was using about 200 GB for cache. It took forever and started crapping out once the OS started freeing cache. After this test ran 'free -m' showed only 3 GB of cached memory. I ran the same benchmark and it ran in the expected amount of time with no CPU craziness that was seen before. Performance stayed constant over repeated runs.
So, contrary to what I read online that OS memory cache does not affect application performance, my experience very much tells me that it does, at least in this particular usage case.
Upvotes: 1
Reputation: 7280
This is a problem with languages that use garbage collection: if you're too close to maximum RAM, they start trying to run the GC all the time, resulting in an increase in CPU usage.
Upvotes: 0