Reputation: 1573
The below code will show that using python loop is faster than using Pandas. My understanding before I tested it was different. So I'm wondering am I using pandas wrongly for this operation? The below code shows that Pandas solution is about 7 times slower:
Pandas time 0.0008931159973144531
Loop time 0.0001239776611328125
Code:
import pandas as pd
import numpy as np
import time
import torch
batch_size = 5
classes = 4
raw_target = torch.from_numpy(np.array([1, 0, 3, 2, 0]))
rows = np.array(range(batch_size))
t0 = time.time()
zeros = pd.DataFrame(0, index=range(batch_size), columns=range(classes))
zeros.iloc[[rows, raw_target.numpy()]] = 1
t1 = time.time()
print("Pandas time ", t1-t0)
t0 = time.time()
target = raw_target.numpy()
zeros = np.zeros((batch_size, classes), dtype=np.float64)
for zero, target in zip(zeros, target):
zero[target] = 1
t1 = time.time()
print("Loop time ", t1-t0)
The code uses PyTorch
because the actual code where the problem exists uses PyTorch
.
What could be better/optimal solution to this example? The resulting matrix is:
[[0. 1. 0. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]]
Upvotes: 0
Views: 210
Reputation: 15119
Depending on your use-case, having everything running through PyTorch could be advantageous (e.g. to keep all computations on the GPU).
The PyTorch-only solution would follow the numpy syntax (i.e. zeros[rows, raw_target] = 1.
):
import numpy as np
import torch
batch_size = 5
classes = 4
raw_target = torch.from_numpy(np.array([1, 0, 3, 2, 0]))
rows = torch.range(0, batch_size-1, dtype=torch.int64)
x = torch.zeros((batch_size, classes), dtype=torch.float64)
x[rows, raw_target] = 1.
print(x.detach())
# tensor([[ 0., 1., 0., 0.],
# [ 1., 0., 0., 0.],
# [ 0., 0., 0., 1.],
# [ 0., 0., 1., 0.],
# [ 1., 0., 0., 0.]], dtype=torch.float64)
Upvotes: 2
Reputation: 11232
You should indeed expect pandas code that works on large data to be faster than iterating over it and zipping with Python. One of the reasons is that Pandas/Numpy can work on the underlying continuous data, whereas with the for loop you have an overhead for creating all the Python objects. You are not seeing that in your profiling as your example data is too small, thus the measures are mostly the setup code.
When doing time profiling you need to take care that you are measuring exactly what you are interested in, and that your measures are repeatable (not drowned in noise).
Here you have very little data (only 5x5), whereas your actual data is probably much larger.
A couple of tips:
%timeit
to get statistical information and not noisy measureAs for the practical solution for you problem, pandas only uses numpy anyhow to represent the data. You can skip pandas and go directly to numpy:
zeros[rows, target] = 1
Upvotes: 1