Grand Koala
Grand Koala

Reputation: 57

Classification with PyTorch is much slower than Tensorflow: 42min vs. 11min

I have been a Tensorflow user and start to use Pytorch. As a trial, I implemented simple classification tasks with both libraries.
However, PyTorch is much slower than Tensorflow: Pytorch takes 42min while TensorFlow 11min. I referred to PyTorch official Tutorial, and made only little change from it.

Could anyone share some advice for this problem?

Here is the summary what I tried.

environment: Colab Pro+
dataset: Cifar10
classifier: VGG16
optimizer: Adam
loss: crossentropy
batch size: 32

PyTorch
Code:

import torch, torchvision
from torch import nn
from torchvision import transforms, models
from tqdm import tqdm
import time, copy

trans = transforms.Compose([transforms.Resize((224, 224)),
                            transforms.ToTensor(),])

data = {phase: torchvision.datasets.CIFAR10('./', train = (phase=='train'),  transform=trans, download=True) for phase in ['train', 'test']}
dataloaders = {phase: torch.utils.data.DataLoader(data[phase], batch_size=32, shuffle=True) for phase in ['train', 'test']}

def train_model(model, criterion, optimizer, dataloaders, device, num_epochs=5):
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'test']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in tqdm(iter(dataloaders[phase])):
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)

            epoch_loss = running_loss / len(dataloaders[phase])
            epoch_acc = running_corrects.double() / len(dataloaders[phase])

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'test' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = models.vgg16(pretrained=False)
model = model.to(device)

model = train_model(model=model,
                    criterion=nn.CrossEntropyLoss(), 
                    optimizer=torch.optim.Adam(model.parameters(), lr=0.001),
                    dataloaders=dataloaders,
                    device=device,
                    )

Result:

Epoch 0/4
----------
  0%|          | 0/1563 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
100%|██████████| 1563/1563 [07:50<00:00,  3.32it/s]
train Loss: 75.5199 Acc: 3.2809
100%|██████████| 313/313 [00:38<00:00,  8.11it/s]
test Loss: 73.7274 Acc: 3.1949

Epoch 1/4
----------
100%|██████████| 1563/1563 [07:50<00:00,  3.33it/s]
train Loss: 73.8162 Acc: 3.2514
100%|██████████| 313/313 [00:38<00:00,  8.13it/s]
test Loss: 73.6114 Acc: 3.1949

Epoch 2/4
----------
100%|██████████| 1563/1563 [07:49<00:00,  3.33it/s]
train Loss: 73.7741 Acc: 3.1369
100%|██████████| 313/313 [00:38<00:00,  8.11it/s]
test Loss: 73.5873 Acc: 3.1949

Epoch 3/4
----------
100%|██████████| 1563/1563 [07:49<00:00,  3.33it/s]
train Loss: 73.7493 Acc: 3.1331
100%|██████████| 313/313 [00:38<00:00,  8.12it/s]
test Loss: 73.6191 Acc: 3.1949

Epoch 4/4
----------
100%|██████████| 1563/1563 [07:49<00:00,  3.33it/s]
train Loss: 73.7289 Acc: 3.1939
100%|██████████| 313/313 [00:38<00:00,  8.13it/s]test Loss: 73.5955 Acc: 3.1949

Training complete in 42m 22s
Best val Acc: 3.194888

Tensorflow
Code:

import tensorflow_datasets as tfds
from tensorflow.keras import applications, models
import tensorflow as tf
import time

ds_test, ds_train = tfds.load('cifar10', split=['test', 'train'])

def resize(ip):
    image = ip['image']
    label = ip['label']
    image = tf.image.resize(image, (224, 224))
    image = tf.expand_dims(image,0)
    label = tf.one_hot(label,10)
    label = tf.expand_dims(label,0)
    return (image, label)

ds_train_ = ds_train.map(resize)
ds_test_ = ds_test.map(resize)


model = applications.vgg16.VGG16(input_shape = (224, 224, 3), weights=None, classes=10)
model.compile(optimizer='adam', loss = 'categorical_crossentropy', metrics= ['accuracy'])

batch_size = 32
since = time.time()
history = model.fit(ds_train_,
                    batch_size = batch_size,
                    steps_per_epoch = len(ds_train)//batch_size,
                    epochs = 5,
                    validation_steps = len(ds_test),
                    validation_data = ds_test_,
                    shuffle = True,)
time_elapsed = time.time() - since
print('Training complete in {:.0f}m {:.0f}s'.format( time_elapsed // 60, time_elapsed % 60 ))

Result:

Epoch 1/5
1562/1562 [==============================] - 125s 69ms/step - loss: 36.9022 - accuracy: 0.1069 - val_loss: 2.3031 - val_accuracy: 0.1000
Epoch 2/5
1562/1562 [==============================] - 129s 83ms/step - loss: 2.3031 - accuracy: 0.1005 - val_loss: 2.3033 - val_accuracy: 0.1000
Epoch 3/5
1562/1562 [==============================] - 129s 83ms/step - loss: 2.3035 - accuracy: 0.1069 - val_loss: 2.3031 - val_accuracy: 0.1000
Epoch 4/5
1562/1562 [==============================] - 129s 83ms/step - loss: 2.3038 - accuracy: 0.1024 - val_loss: 2.3030 - val_accuracy: 0.1000
Epoch 5/5
1562/1562 [==============================] - 129s 83ms/step - loss: 2.3028 - accuracy: 0.1024 - val_loss: 2.3033 - val_accuracy: 0.1000
Training complete in 11m 23s

Upvotes: 2

Views: 915

Answers (1)

Laplace Ricky
Laplace Ricky

Reputation: 1687

It is because in your tensorflow codes, the data pipeline is feeding a batch of 1 image into the model per step instead of a batch of 32 images.

Passing batch_size into model.fit does not really control the batch size when the data is in the form of datasets. The reason why it showed a seemingly correct steps per epoch from the log is that you passed steps_per_epoch into model.fit.

To correctly set the batch size:

ds_test, ds_train = tfds.load('cifar10', split=['test', 'train'])

def resize(ip):
    image = ip['image']
    label = ip['label']
    image = tf.image.resize(image, (224, 224))
    label = tf.one_hot(label,10)
    return (image, label)

train_size=len(ds_train)
test_size=len(ds_test)
ds_train_ = ds_train.shuffle(train_size).batch(32).map(resize)
ds_test_ = ds_test.shuffle(test_size).batch(32).map(resize)

model.fit call:

history = model.fit(ds_train_,
                    epochs = 1,
                    validation_data = ds_test_)

After fixed the problem, tensorflow got similar speed performance with pytorch. In my machine, pytorch took ~27 minutes per epoch while tensorflow took ~24 minutes per epoch.

According to the benchmarks from NVIDIA, pytorch and tensorflow had similar speed performance in most popular deep learning applications with real-world datasets and problem size. (Reference: https://developer.nvidia.com/deep-learning-performance-training-inference)

Upvotes: 2

Related Questions