Reputation: 31
For GPU training of model, I am using
dudt = Chain(Dense(3,100,tanh),
Dense(100,3)) |> gpu
versus
CPU training
dudt = FastChain(
FastDense(3,100,tanh),
FastDense(100,3))
Over 1000 iterations, Fastchain is orders of magnitude faster than running GPU Tesla K40c. Is this expected behaviour? Otherwise, could I be doing something wrong with implementing the model on GPUs. MWE for GPU implementation as follows:
function lorenz(du,u,p,t)
σ = p[1]; ρ = p[2]; β = p[3]
du[1] = σ*(u[2]-u[1])
du[2] = u[1]*(ρ-u[3]) - u[2]
du[3] = u[1]*u[2] - β*u[3]
return
end
u0 = Float32[1.0,0.0,0.0]
tspan = (0.0,1.0)
para = [10.0,28.0,8/3]
prob = ODEProblem(lorenz, u0, tspan, para)
t = range(tspan[1],tspan[2],length=101)
ode_data = Array(solve(prob,Tsit5(),saveat=t))
ode_data = cu(ode_data)
u0train = [1.0,0.0,0.0] |> gpu
tspantrain = (0.0,1.0)
ttrain = range(tspantrain[1],tspantrain[2],length=101)
dudt = Chain(Dense(3,100,tanh),
Dense(100,3)) |> gpu
n_ode = NeuralODE((dudt),tspantrain,Tsit5(),saveat=ttrain)
function predict_n_ode(p)
n_ode(u0train,p)
end
function loss_n_ode(p)
pred = predict_n_ode(p) |> gpu
loss = sum(abs2, pred .- ode_data)
loss,pred
end
res1 = DiffEqFlux.sciml_train(loss_n_ode, n_ode.p, ADAM(0.01), cb=cb, maxiters = 1000)
Upvotes: 1
Views: 258
Reputation: 19132
That model is too small for GPU parallelism to really make a difference. The neural network is essentially a 3 matvecs, 100x3, 100x100, 3x100. The only one with a kernel that probably comes close to breaking even is the middle one, where a 100x100 matrix is multiplied by a length 100 vector.
For example, on my machine:
using BenchmarkTools, CuArrays
A = rand(100,100); x = rand(100);
@btime A*x; # 56.299 μs (1 allocation: 896 bytes)
gA = cu(A); gx = cu(x)
@btime gA*gx; # 12.499 μs (6 allocations: 160 bytes)
A = rand(100,3); x = rand(3);
@btime A*x; # 251.695 ns (1 allocation: 896 bytes)
gA = cu(A); gx = cu(x)
@btime gA*gx; # 12.212 μs (6 allocations: 160 bytes)
So while the speedup on the largest operation does exist, it's not enough to overcome the slowdown by putting other small operations on the GPU. This is because GPUs have a high floor (on my machine around 12μs) so you have to make sure your problem is large enough for it to really make sense. Generally machine learning benefits from GPUs because it's dominated by large matrix multiplications with layers in the size of tens of thousands.
Upvotes: 2