Colera Su
Colera Su

Reputation: 193

Performance drop in NumPy matrix-vector multiplication

I've encountered some (mysterious?) performance issue on NumPy matrix-vector multiplication.

I wrote the following snippet to test the speed of matrix-vector multiplication:

import timeit
for i in range(90, 101):
    tm = timeit.repeat('np.matmul(a, b)', number = 10000,
        setup = 'import numpy as np; a, b = np.random.rand({0},{0}), np.random.rand({0})'.format(i))
    print(i, sum(tm) / 5)

In some machines, the result is normal:

90 0.08936462279998522
91 0.08872119059979014
92 0.09083068459967762
93 0.09311594780047017
94 0.09907015420012613
95 0.10136517100036144
96 0.10339414420013782
97 0.10627872140012187
98 0.1102267580001353
99 0.11277738099979615
100 0.11471197419996315

In some machines, the multiplication slowed down at size 96:

90 0.03618830284103751
91 0.03737151022069156
92 0.03295294055715203
93 0.02851409767754376
94 0.02677299762144685
95 0.028137388220056892
96 0.1916038074065
97 0.16719966367818415
98 0.18511182265356182
99 0.1806833743583411
100 0.17172936061397195

Some even slowed down by a factor of 1000:

90 0.04183819475583732
91 0.029678784403949977
92 0.02486871089786291
93 0.02882006801664829
94 0.028613184532150625
95 0.02956576123833656
96 31.16711748293601
97 27.803299666382372
98 31.368976181373
99 27.71114011341706
100 26.219610543036833

The Python / NumPy version is the same on all the machines I tested (3.7.2 / 1.16.2). The OS is also the same (Arch Linux).

What is the possible reason for this? And why this occurs at size 96?

Upvotes: 6

Views: 349

Answers (2)

Alex Lopatin
Alex Lopatin

Reputation: 692

I think that I got finally the correct answer and explanation to why:

  1. This problem is fixed in the Python version 3.8.0a2 (Current Pre-release Testing Version)
  2. The problem exists in Python v 3.7.2 (Latest Release) on Windows and macOS.

I wrote a little bit more longer program to test both my Widows and macOS computers. Looks like NumPy in version 3.7 started to run matmul function in all four logical processors on my computers. I don't see this in 3.8.02a:

$ python3.8 numpy_matmul.py       $ python3.7 numpy_matmul.py     

Python version  : 3.8.0a2         Python version  : 3.7.2         
  build:('v3.8.0a2:23f4589b4b',    build:('v3.7.2:9a3ffc0492',
        Feb 25 2019 10:59:08')          'Dec 24 2018 02:44:43')
  compiler:                        compiler:
     Clang 6.0 (clang-600.0.57)   Clang 6.0 (clang-600.0.57) 

Tested by Python code only :      Tested by Python code only :  
 90 time = 0.1132 cpu = 0.1100     90 time = 0.1535 cpu = 0.1236
 91 time = 0.1133 cpu = 0.1130     91 time = 0.1264 cpu = 0.1263
 92 time = 0.1079 cpu = 0.1077     92 time = 0.1089 cpu = 0.1087
 93 time = 0.1146 cpu = 0.1145     93 time = 0.1226 cpu = 0.1224
 94 time = 0.1176 cpu = 0.1174     94 time = 0.1273 cpu = 0.1271
 95 time = 0.1216 cpu = 0.1215     95 time = 0.1372 cpu = 0.1371
 96 time = 0.1115 cpu = 0.1114     96 time = 0.2854 cpu = 0.8933
 97 time = 0.1231 cpu = 0.1229     97 time = 0.2887 cpu = 0.9033
 98 time = 0.1174 cpu = 0.1173     98 time = 0.2836 cpu = 0.8963
 99 time = 0.1330 cpu = 0.1301     99 time = 0.3100 cpu = 0.9108
100 time = 0.1130 cpu = 0.1128    100 time = 0.3149 cpu = 0.9087

Tested with timeit.repeat :       Tested with timeit.repeat :   
 90 time = 0.1060 cpu = 0.1066     90 time = 0.1238 cpu = 0.3264
 91 time = 0.1091 cpu = 0.1097     91 time = 0.1233 cpu = 0.1240
 92 time = 0.1021 cpu = 0.1027     92 time = 0.1138 cpu = 0.1128
 93 time = 0.1149 cpu = 0.1156     93 time = 0.1324 cpu = 0.1327
 94 time = 0.1135 cpu = 0.1139     94 time = 0.1319 cpu = 0.1326
 95 time = 0.1170 cpu = 0.1177     95 time = 0.1325 cpu = 0.1331
 96 time = 0.1069 cpu = 0.1076     96 time = 0.2879 cpu = 0.8886
 97 time = 0.1192 cpu = 0.1198     97 time = 0.2867 cpu = 0.8986
 98 time = 0.1151 cpu = 0.1155     98 time = 0.3034 cpu = 0.8854
 99 time = 0.1200 cpu = 0.1207     99 time = 0.2867 cpu = 0.8966
100 time = 0.1146 cpu = 0.1153    100 time = 0.2901 cpu = 0.9018

Here is numpy_matmul.py:

import time
import timeit
import numpy as np
import platform


def correct_cpu(cpu_time):
    pv1, pv2, _ = platform.python_version_tuple()
    pcv = platform.python_compiler()
    if pv1 == '3' and '5' <= pv2 <= '8' and pcv =='Clang 6.0 (clang-600.0.57)':
        cpu_time /= 2.0
    return cpu_time


def test(func, n, name):
    print('\nTested %s :' % name)
    for i in range(90, 101):
        t = time.perf_counter()
        c = time.process_time()
        tm = func(i, n)
        t = time.perf_counter() - t
        c = correct_cpu(time.process_time() - c)
        st = t if tm <= 0.0 else tm
        print('%3d time = %.4f cpu = %.4f' % (i, st, c))
        if abs(t-st)/st > 0.02:
            print('    time!= %.4f' % t)


def test1(i, n):
    a, b = np.random.rand(i, i), np.random.rand(i)
    for _ in range(n):
        np.matmul(a, b)
    return 0.0


def test2(i, n):
    s = 'import numpy as np;' + \
        'a, b = np.random.rand({0},{0}), np.random.rand({0})'
    s = s.format(i)
    r = 'np.matmul(a, b)'
    t = timeit.repeat(stmt=r, setup=s, number=n)
    return sum(t)


def test3(i, n):
    s = 'import numpy as np;' + \
        'a, b = np.random.rand({0},{0}), np.random.rand({0})'
    s = s.format(i)
    r = 'np.matmul(a, b)'
    return timeit.timeit(stmt=r, setup=s, number=n)


print('Python version  :', platform.python_version())
print('       build    :', platform.python_build())
print('       compiler :', platform.python_compiler())
num = 10000
test(test1, 5 * num, 'by Python code only')
test(test2, num, 'with timeit.repeat')
test(test3, 5 * num, 'with timeit.timeit')

Upvotes: 2

Alex Lopatin
Alex Lopatin

Reputation: 692

At 96 your test reaches some software/hardware problem: 96*96*96 = 884,736. Close to 1M and multiply by 8 bytes for float number: 7,077,888. Intel i5 processor has 6 MB L3 cache. My iMac has this type of processor and has this slow down problem at 96 size. The Intel® Core™ i5-7200U Processor has 3 MB L3 cache and doesn't have this problem. So, it could be the software algorithm not correctly working with 6 MB cache size.

Upvotes: 1

Related Questions