Classification with PyTorch is much slower than Tensorflow: 42min vs. 11min

Question

I have been a Tensorflow user and start to use Pytorch. As a trial, I implemented simple classification tasks with both libraries.
However, PyTorch is much slower than Tensorflow: Pytorch takes 42min while TensorFlow 11min. I referred to PyTorch official Tutorial, and made only little change from it.

Could anyone share some advice for this problem?

Here is the summary what I tried.

environment: Colab Pro+
dataset: Cifar10
classifier: VGG16
optimizer: Adam
loss: crossentropy
batch size: 32

PyTorch
Code:

import torch, torchvision
from torch import nn
from torchvision import transforms, models
from tqdm import tqdm
import time, copy

trans = transforms.Compose([transforms.Resize((224, 224)),
                            transforms.ToTensor(),])

data = {phase: torchvision.datasets.CIFAR10('./', train = (phase=='train'),  transform=trans, download=True) for phase in ['train', 'test']}
dataloaders = {phase: torch.utils.data.DataLoader(data[phase], batch_size=32, shuffle=True) for phase in ['train', 'test']}

def train_model(model, criterion, optimizer, dataloaders, device, num_epochs=5):
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'test']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in tqdm(iter(dataloaders[phase])):
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)

            epoch_loss = running_loss / len(dataloaders[phase])
            epoch_acc = running_corrects.double() / len(dataloaders[phase])

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'test' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = models.vgg16(pretrained=False)
model = model.to(device)

model = train_model(model=model,
                    criterion=nn.CrossEntropyLoss(), 
                    optimizer=torch.optim.Adam(model.parameters(), lr=0.001),
                    dataloaders=dataloaders,
                    device=device,
                    )

Result:

Epoch 0/4
----------
  0%|          | 0/1563 [00:00


Tensorflow

Code:
import tensorflow_datasets as tfds
from tensorflow.keras import applications, models
import tensorflow as tf
import time

ds_test, ds_train = tfds.load('cifar10', split=['test', 'train'])

def resize(ip):
    image = ip['image']
    label = ip['label']
    image = tf.image.resize(image, (224, 224))
    image = tf.expand_dims(image,0)
    label = tf.one_hot(label,10)
    label = tf.expand_dims(label,0)
    return (image, label)

ds_train_ = ds_train.map(resize)
ds_test_ = ds_test.map(resize)


model = applications.vgg16.VGG16(input_shape = (224, 224, 3), weights=None, classes=10)
model.compile(optimizer='adam', loss = 'categorical_crossentropy', metrics= ['accuracy'])

batch_size = 32
since = time.time()
history = model.fit(ds_train_,
                    batch_size = batch_size,
                    steps_per_epoch = len(ds_train)//batch_size,
                    epochs = 5,
                    validation_steps = len(ds_test),
                    validation_data = ds_test_,
                    shuffle = True,)
time_elapsed = time.time() - since
print('Training complete in {:.0f}m {:.0f}s'.format( time_elapsed // 60, time_elapsed % 60 ))

Result:
Epoch 1/5
1562/1562 [==============================] - 125s 69ms/step - loss: 36.9022 - accuracy: 0.1069 - val_loss: 2.3031 - val_accuracy: 0.1000
Epoch 2/5
1562/1562 [==============================] - 129s 83ms/step - loss: 2.3031 - accuracy: 0.1005 - val_loss: 2.3033 - val_accuracy: 0.1000
Epoch 3/5
1562/1562 [==============================] - 129s 83ms/step - loss: 2.3035 - accuracy: 0.1069 - val_loss: 2.3031 - val_accuracy: 0.1000
Epoch 4/5
1562/1562 [==============================] - 129s 83ms/step - loss: 2.3038 - accuracy: 0.1024 - val_loss: 2.3030 - val_accuracy: 0.1000
Epoch 5/5
1562/1562 [==============================] - 129s 83ms/step - loss: 2.3028 - accuracy: 0.1024 - val_loss: 2.3033 - val_accuracy: 0.1000
Training complete in 11m 23s

Laplace Ricky · Accepted Answer

It is because in your tensorflow codes, the data pipeline is feeding a batch of 1 image into the model per step instead of a batch of 32 images.

Passing batch_size into model.fit does not really control the batch size when the data is in the form of datasets. The reason why it showed a seemingly correct steps per epoch from the log is that you passed steps_per_epoch into model.fit.

To correctly set the batch size:

ds_test, ds_train = tfds.load('cifar10', split=['test', 'train'])

def resize(ip):
    image = ip['image']
    label = ip['label']
    image = tf.image.resize(image, (224, 224))
    label = tf.one_hot(label,10)
    return (image, label)

train_size=len(ds_train)
test_size=len(ds_test)
ds_train_ = ds_train.shuffle(train_size).batch(32).map(resize)
ds_test_ = ds_test.shuffle(test_size).batch(32).map(resize)

model.fit call:

history = model.fit(ds_train_,
                    epochs = 1,
                    validation_data = ds_test_)

After fixed the problem, tensorflow got similar speed performance with pytorch. In my machine, pytorch took ~27 minutes per epoch while tensorflow took ~24 minutes per epoch.

According to the benchmarks from NVIDIA, pytorch and tensorflow had similar speed performance in most popular deep learning applications with real-world datasets and problem size. (Reference: https://developer.nvidia.com/deep-learning-performance-training-inference)

Classification with PyTorch is much slower than Tensorflow: 42min vs. 11min

Answers (1)

Related Questions