Reputation: 461
I'm building an image to train on AML service, trying to get torchvision==0.3.0 onboard that image. The notebook VM that I'm using has torchvision 0.3.0 and pytorch 1.1.0 that and it allowed me to do what I'm trying to do... but only on the notebook VM. When I submit the job to AML, I get an error :
Error occurred: module 'torchvision.models' has no attribute 'googlenet'
I've managed to capture the logs at image creation. This a part of the extract that shows partially what's going on:
Created wheel for dill: filename=dill-0.3.0-cp36-none-any.whl size=77512 sha256=b39463bd613a2337f86181d449e55c84446bb76c2fad462b0ff7ed721872f817
Stored in directory: /root/.cache/pip/wheels/c9/de/a4/a91eec4eea652104d8c81b633f32ead5eb57d1b294eab24167
Successfully built horovod future json-logging-py psutil absl-py pathspec liac-arff dill
Installing collected packages: tqdm, ptvsd, gunicorn, applicationinsights, urllib3, idna, chardet, requests, asn1crypto, cryptography, pyopenssl, isodate, oauthlib, requests-oauthlib, msrest, jsonpickle, azure-common, PyJWT, python-dateutil, adal, msrestazure, azure-mgmt-authorization, azure-mgmt-containerregistry, pyasn1, ndg-httpsclient, pathspec, azure-mgmt-keyvault, websocket-client, docker, contextlib2, azure-mgmt-resource, backports.weakref, backports.tempfile, jeepney, SecretStorage, pytz, azure-mgmt-storage, ruamel.yaml, azure-graphrbac, jmespath, azureml-core, configparser, json-logging-py, werkzeug, click, MarkupSafe, Jinja2, itsdangerous, flask,liac-arff, pandas, dill, azureml-model-management-sdk, azureml-defaults, torchvision, cloudpickle, psutil, horovod, markdown, protobuf, grpcio, absl-py, tensorboard, future
Found existing installation: torchvision 0.3.0
Uninstalling torchvision-0.3.0:
Successfully uninstalled torchvision-0.3.0
Successfully installed Jinja2-2.10.1 MarkupSafe-1.1.1 PyJWT-1.7.1 SecretStorage-3.1.1 absl-py-0.7.1 adal-1.2.2 applicationinsights-0.11.9 asn1crypto-0.24.0 azure-common-1.1.23 azure-graphrbac-0.61.1 azure-mgmt-authorization-0.60.0 azure-mgmt-containerregistry-2.8.0 azure-mgmt-keyvault-2.0.0 azure-mgmt-resource-3.1.0 azure-mgmt-storage-4.0.0 azureml-core-1.0.55 azureml-defaults-1.0.55 azureml-model-management-sdk-1.0.1b6.post1 backports.tempfile-1.0 backports.weakref-1.0.post1 chardet-3.0.4 click-7.0 cloudpickle-1.2.1 configparser-3.7.4 contextlib2-0.5.5 cryptography-2.7 dill-0.3.0 docker-4.0.2 flask-1.0.3 future-0.17.1 grpcio-1.22.0 gunicorn-19.9.0 horovod-0.16.1 idna-2.8 isodate-0.6.0 itsdangerous-1.1.0 jeepney-0.4.1 jmespath-0.9.4 json-logging-py-0.2 jsonpickle-1.2 liac-arff-2.4.0 markdown-3.1.1 msrest-0.6.9 msrestazure-0.6.1 ndg-httpsclient-0.5.1 oauthlib-3.1.0 pandas-0.25.0 pathspec-0.5.9 protobuf-3.9.1 psutil-5.6.3 ptvsd-4.3.2 pyasn1-0.4.6 pyopenssl-19.0.0 python-dateutil-2.8.0 pytz-2019.2 requests-2.22.0 requests-oauthlib-1.2.0 ruamel.yaml-0.15.89 tensorboard-1.14.0 torchvision-0.2.1 tqdm-4.33.0 urllib3-1.25.3 websocket-client-0.56.0 werkzeug-0.15.5
Without going into too much details, here's the code that I use to create the estimator, and then, submit the job. Nothing particularly fancy.
I tried debugging the image creation process (looking into the logs) and this is where I've captured what's shown above. I've also tried connecting using a python debugger to the running processes, and/or log to bash inside the running docker container to try python interactive to see what my problem is. Originally the problem is I can't use the torchvision.models.googlenet
as it's not figuring in the version in use.
conda_packages=['pytorch', 'scikit-learn', 'torchvision==0.3.0']
pip_packages=['tqdm', 'ptvsd']
and I create my estimator with this :
pyTorchEstimator = PyTorch(source_directory='./aml-image-models',
compute_target=ct,
entry_script='train_network.py',
script_params=script_params,
node_count=1,
process_count_per_node=1,
conda_packages=conda_packages,
pip_packages=pip_packages,
use_gpu=True,
framework_version = '1.1')
and submit with typical code.
I'd expect given that I'm specifying 0.3.0 in the dependencies, that it would just work.
Thoughts?
Upvotes: 0
Views: 112
Reputation: 21
torchvision 0.2.1 is pre-configured in PyTorch estimator for torch version 1.0/1.1. https://learn.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.dnn.pytorch?view=azure-ml-py#remarks
However, you still can override the torchvision after estimator initialization.
estimator.conda_dependencies.add_pip_package('torchvision==0.3.0')
Anther option is just to use generic Estimator if you are sure about the dependencies you need.
conda_packages=['pytorch', 'scikit-learn', 'torchvision==0.3.0']
pip_packages=['tqdm', 'ptvsd']
estimator = Estimator(source_directory='./aml-image-models',
compute_target=ct,
entry_script='train_network.py',
script_params=script_params,
conda_packages=conda_packages,
pip_packages=pip_packages,
use_gpu=True)
Upvotes: 1