Bbrown44
Bbrown44

Reputation: 49

convert formula into DML in SystemML (Java)

How would you convert this formula.

formula = sum(U' (W . (U S)))

Legend  
'   transpose of a matrix
*   matrix-matrix multiplication
.   scalar multiplication

from python

u = np.random.rand(1000,10000)
s = np.random.rand(10000,1000)
w = np.random.rand(1000,1000)

start = time.time()
res = np.sum(u.T.dot(w * u.dot(s)))
print time.time()-start

using DML in SystemML with the following data

u = np.random.rand(10000,100000)
s = np.random.rand(100000,10000)
w = np.random.rand(10000,10000)

Syntax

t(M)    transpose of a matrix, where M is the matrix
%*%     matrix-matrix multiplication
*       scalar multiplication

Task

Upvotes: 0

Views: 187

Answers (1)

mboehm7
mboehm7

Reputation: 115

Well, you can almost directly write your original expression as shown below. However, note that without consumer this would be subject to dead code elimination - hence the print:

U = rand(rows=1000, cols=10000);
S = rand(rows=10000, cols=1000);
W = rand(rows=1000, cols=1000);

print(sum(t(U) %*% (W * (U %*% S))))

This will be compiled to the following runtime plan:

PROGRAM
--MAIN PROGRAM
----GENERIC (lines 1-9) [recompile=false]
------(8) dg(rand) [1000,10000,1000,1000,10000000] [0,0,76 -> 76MB], CP
------(33) ua(+R) (8) [1000,1,1000,1000,-1] [76,0,0 -> 76MB], CP
------(40) r(t) (33) [1,1000,1000,1000,-1] [0,0,0 -> 0MB], CP
------(26) dg(rand) [1000,1000,1000,1000,1000000] [0,0,8 -> 8MB], CP
------(17) dg(rand) [10000,1000,1000,1000,10000000] [0,0,76 -> 76MB], CP
------(28) ba(+*) (8,17) [1000,1000,1000,1000,-1] [153,5,8 -> 165MB], CP
------(29) b(*) (26,28) [1000,1000,1000,1000,-1] [15,0,8 -> 23MB], CP
------(35) ua(+R) (29) [1000,1,1000,1000,-1] [8,0,0 -> 8MB], CP
------(38) ba(+*) (40,35) [1,1,1000,1000,-1] [0,0,0 -> 0MB], CP
------(39) u(cast_as_scalar) (38) [0,0,0,0,-1] [0,0,0 -> 0MB]
------(32) u(print) (39) [-1,-1,-1,-1,-1] [0,0,0 -> 0MB]

which would correspond to the following script level expression (that pushes the final sum through the last matrix multiply and transpose):

print(as.scalar(t(rowSums(U)) %*% rowSums(W * (U %*% S))))

Furthermore, different data characteristics (dimensions and sparsity) can result in vastly different execution plans. For example, if we have a sparse W and an outer-product like matrix multiply

U = rand(rows=10000, cols=100);
S = rand(rows=100, cols=10000);
W = rand(rows=10000, cols=10000, sparsity=0.001);

we get the following execution plan

PROGRAM
--MAIN PROGRAM
----GENERIC (lines 5-9) [recompile=false]
------(8) dg(rand) [10000,100,1000,1000,1000000] [0,0,8 -> 8MB], CP
------(33) ua(+R) (8) [10000,1,1000,1000,-1] [8,0,0 -> 8MB], CP
------(43) r(t) (33) [1,10000,1000,1000,-1] [0,0,0 -> 0MB], CP
------(26) dg(rand) [10000,10000,1000,1000,100000] [0,0,2 -> 2MB], CP
------(17) dg(rand) [100,10000,1000,1000,1000000] [0,0,8 -> 8MB], CP
------(37) r(t) (17) [10000,100,1000,1000,1000000] [8,0,8 -> 15MB], CP
------(39) q(wdivmm) (26,8,37) [10000,10000,1000,1000,100000] [17,0,2 -> 19MB], CP
------(35) ua(+R) (39) [10000,1,1000,1000,-1] [2,0,0 -> 2MB], CP
------(41) ba(+*) (43,35) [1,1,1000,1000,-1] [0,0,0 -> 0MB], CP
------(42) u(cast_as_scalar) (41) [0,0,0,0,-1] [0,0,0 -> 0MB]
------(32) u(print) (42) [-1,-1,-1,-1,-1] [0,0,0 -> 0MB]

where wdivmm (a built-in, sparsity-exploiting, fused operator) replaces W * (U %*% S). This fused operator only computes necessary dot products Ui %*% Vj for non-zero values Wij. With enabled code generation for operator fusion (has to be explicitly enabled via sysml.codegen.enabled=true), the subsequent rowSums would be also fused into this chain.

By extension, this automatic compilation also applies to distributed operations. For example, assume you have a small driver of 10GB max heap with 16 virtual cores, and the following scenario:

U = rand(rows=1000000, cols=100);
S = rand(rows=100, cols=1000000);
W = rand(rows=1000000, cols=1000000, sparsity=0.001);

you would end up with the following plan, where operators marked with SPARK would run as distributed operations, while operators marked with CP (i.e., control program) would run locally in the driver:

# Memory Budget local/remote = 6372MB/183420MB/220104MB/12839MB
# Degree of Parallelism (vcores) local/remote = 16/144
PROGRAM
--MAIN PROGRAM
----GENERIC (lines 5-9) [recompile=true]
------(8) dg(rand) [1000000,100,1000,1000,100000000] [0,0,763 -> 763MB], CP
------(33) ua(+R) (8) [1000000,1,1000,1000,-1] [763,15,8 -> 786MB], CP
------(43) r(t) (33) [1,1000000,1000,1000,-1] [8,0,8 -> 15MB], CP
------(26) dg(rand) [1000000,1000000,1000,1000,1000000000] [0,8,11524 -> 11532MB], SPARK
------(17) dg(rand) [100,1000000,1000,1000,100000000] [0,0,763 -> 763MB], CP
------(37) r(t) (17) [1000000,100,1000,1000,100000000] [763,0,763 -> 1526MB], CP
------(39) q(wdivmm) (26,8,37) [1000000,1000000,1000,1000,1000000000] [13050,0,11524 -> 24574MB], SPARK
------(35) ua(+R) (39) [1000000,1,1000,1000,-1] [11524,15,8 -> 11547MB], SPARK
------(41) ba(+*) (43,35) [1,1,1000,1000,-1] [15,0,0 -> 15MB], SPARK
------(42) u(cast_as_scalar) (41) [0,0,0,0,-1] [0,0,0 -> 0MB], CP
------(32) u(print) (42) [-1,-1,-1,-1,-1] [0,0,0 -> 0MB]

Upvotes: 1

Related Questions