Reputation: 371
I followed this guide to launch my PyTorch Lightning project on Google Colab TPU. So I installed
!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
Then
!pip install pytorch-lightning
Then I
!pip install torch torchvision torchaudio
!pip install -r requirements.txt
After installing the project requirements, I restarted the runtime as requested and re-ran the cloud-TPU-client install, the pytorch-lightning install, and both command from above. It ran smoothly.
But just after the TPU has started with version PyTorch version 1.9, I get the following import error :
WARNING:root:TPU has started up successfully with version pytorch-1.9
Traceback (most recent call last):
File "synthesizer_train.py", line 2, in <module>
from synthesizer.train import train
File "/content/Real-Time-Voice-Cloning/synthesizer/train.py", line 6, in <module>
from synthesizer.models.tacotron import Tacotron
File "/content/Real-Time-Voice-Cloning/synthesizer/models/tacotron.py", line 7, in <module>
import pytorch_lightning as pl
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/__init__.py", line 20, in <module>
from pytorch_lightning.callbacks import Callback # noqa: E402
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/__init__.py", line 14, in <module>
from pytorch_lightning.callbacks.base import Callback
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/base.py", line 26, in <module>
from pytorch_lightning.utilities.types import STEP_OUTPUT
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/__init__.py", line 18, in <module>
from pytorch_lightning.utilities.apply_func import move_data_to_device # noqa: F401
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/apply_func.py", line 26, in <module>
from pytorch_lightning.utilities.imports import _compare_version, _TORCHTEXT_AVAILABLE
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/imports.py", line 101, in <module>
from pytorch_lightning.utilities.xla_device import XLADeviceUtils # noqa: E402
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/xla_device.py", line 24, in <module>
import torch_xla.core.xla_model as xm
File "/usr/local/lib/python3.7/dist-packages/torch_xla/__init__.py", line 142, in <module>
import _XLAC
ImportError: /usr/local/lib/python3.7/dist-packages/_XLAC.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN2at13_foreach_erf_EN3c108ArrayRefINS_6TensorEEE
Trainer
was launched with the flag TPU_cores=8
.
The model had run on CPU and GPU beforehand (ie on another session).
I tried to downgrade PyTorch to 1.9 (the same as the one shown when TPU is starting) because Colab uses torch 1.10.0+cu111 and a different error appeared :
WARNING:root:TPU has started up successfully with version pytorch-1.9
Traceback (most recent call last):
File "synthesizer_train.py", line 2, in <module>
from synthesizer.train import train
File "/content/Real-Time-Voice-Cloning/synthesizer/train.py", line 6, in <module>
from synthesizer.models.tacotron import Tacotron
File "/content/Real-Time-Voice-Cloning/synthesizer/models/tacotron.py", line 7, in <module>
import pytorch_lightning as pl
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/__init__.py", line 20, in <module>
from pytorch_lightning.callbacks import Callback # noqa: E402
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/__init__.py", line 14, in <module>
from pytorch_lightning.callbacks.base import Callback
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/base.py", line 26, in <module>
from pytorch_lightning.utilities.types import STEP_OUTPUT
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/__init__.py", line 18, in <module>
from pytorch_lightning.utilities.apply_func import move_data_to_device # noqa: F401
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/apply_func.py", line 29, in <module>
if _compare_version("torchtext", operator.ge, "0.9.0"):
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/imports.py", line 54, in _compare_version
pkg = importlib.import_module(package)
File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/usr/local/lib/python3.7/dist-packages/torchtext/__init__.py", line 5, in <module>
from . import vocab
File "/usr/local/lib/python3.7/dist-packages/torchtext/vocab/__init__.py", line 11, in <module>
from .vocab_factory import (
File "/usr/local/lib/python3.7/dist-packages/torchtext/vocab/vocab_factory.py", line 4, in <module>
from torchtext._torchtext import (
ImportError: /usr/local/lib/python3.7/dist-packages/torchtext/_torchtext.so: undefined symbol: _ZTVN5torch3jit6MethodE
Is there anything I can do to train the model on TPU ?
Thank you very much
Upvotes: 1
Views: 5648
Reputation: 278
Based on above solution, we could additionally fix the issue for sure by finding the version of cuda installed with
import torch
torch.version.cuda
10.2
Based on this cuda version perform this pip install
command
!pip install cloud-tpu-client==0.10 torchvision==0.12.0+cu102 torch==1.11.0+cu102 https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-1.11-cp37-cp37m-linux_x86_64.whl -f https://download.pytorch.org/whl/cu102/torch_stable.html
Pls notice the cu102
in three places in the above command
Upvotes: 1
Reputation: 371
Actually the same problem has also been described and the suggested solution did work for me.
So in the details they suggest to downgrade PyTorch to 1.9.0+cu111
(mind the +cu111
) after installing torch_xla.
Consequently here are the steps I followed to launch my Lightning project on Google Colab with TPU :
!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchtext==0.10.0 -f https://download.pytorch.org/whl/cu111/torch_stable.html
And then the project's pip :
!pip install torch torchvision torchaudio pytorch-lightning
!pip install -r requirements.txt
And it worked even though after this last step, I had to restart runtime.
Upvotes: 1