Torchvision 0.3.0 for training a model on AML service

Question

I'm building an image to train on AML service, trying to get torchvision==0.3.0 onboard that image. The notebook VM that I'm using has torchvision 0.3.0 and pytorch 1.1.0 that and it allowed me to do what I'm trying to do... but only on the notebook VM. When I submit the job to AML, I get an error :

Error occurred: module 'torchvision.models' has no attribute 'googlenet'

I've managed to capture the logs at image creation. This a part of the extract that shows partially what's going on:

  Created wheel for dill: filename=dill-0.3.0-cp36-none-any.whl size=77512 sha256=b39463bd613a2337f86181d449e55c84446bb76c2fad462b0ff7ed721872f817

  Stored in directory: /root/.cache/pip/wheels/c9/de/a4/a91eec4eea652104d8c81b633f32ead5eb57d1b294eab24167

Successfully built horovod future json-logging-py psutil absl-py pathspec liac-arff dill

Installing collected packages: tqdm, ptvsd, gunicorn, applicationinsights, urllib3, idna, chardet, requests, asn1crypto, cryptography, pyopenssl, isodate, oauthlib, requests-oauthlib, msrest, jsonpickle, azure-common, PyJWT, python-dateutil, adal, msrestazure, azure-mgmt-authorization, azure-mgmt-containerregistry, pyasn1, ndg-httpsclient, pathspec, azure-mgmt-keyvault, websocket-client, docker, contextlib2, azure-mgmt-resource, backports.weakref, backports.tempfile, jeepney, SecretStorage, pytz, azure-mgmt-storage, ruamel.yaml, azure-graphrbac, jmespath, azureml-core, configparser, json-logging-py, werkzeug, click, MarkupSafe, Jinja2, itsdangerous, flask,liac-arff, pandas, dill, azureml-model-management-sdk, azureml-defaults, torchvision, cloudpickle, psutil, horovod, markdown, protobuf, grpcio, absl-py, tensorboard, future

  Found existing installation: torchvision 0.3.0

    Uninstalling torchvision-0.3.0:

      Successfully uninstalled torchvision-0.3.0

Successfully installed Jinja2-2.10.1 MarkupSafe-1.1.1 PyJWT-1.7.1 SecretStorage-3.1.1 absl-py-0.7.1 adal-1.2.2 applicationinsights-0.11.9 asn1crypto-0.24.0 azure-common-1.1.23 azure-graphrbac-0.61.1 azure-mgmt-authorization-0.60.0 azure-mgmt-containerregistry-2.8.0 azure-mgmt-keyvault-2.0.0 azure-mgmt-resource-3.1.0 azure-mgmt-storage-4.0.0 azureml-core-1.0.55 azureml-defaults-1.0.55 azureml-model-management-sdk-1.0.1b6.post1 backports.tempfile-1.0 backports.weakref-1.0.post1 chardet-3.0.4 click-7.0 cloudpickle-1.2.1 configparser-3.7.4 contextlib2-0.5.5 cryptography-2.7 dill-0.3.0 docker-4.0.2 flask-1.0.3 future-0.17.1 grpcio-1.22.0 gunicorn-19.9.0 horovod-0.16.1 idna-2.8 isodate-0.6.0 itsdangerous-1.1.0 jeepney-0.4.1 jmespath-0.9.4 json-logging-py-0.2 jsonpickle-1.2 liac-arff-2.4.0 markdown-3.1.1 msrest-0.6.9 msrestazure-0.6.1 ndg-httpsclient-0.5.1 oauthlib-3.1.0 pandas-0.25.0 pathspec-0.5.9 protobuf-3.9.1 psutil-5.6.3 ptvsd-4.3.2 pyasn1-0.4.6 pyopenssl-19.0.0 python-dateutil-2.8.0 pytz-2019.2 requests-2.22.0 requests-oauthlib-1.2.0 ruamel.yaml-0.15.89 tensorboard-1.14.0 torchvision-0.2.1 tqdm-4.33.0 urllib3-1.25.3 websocket-client-0.56.0 werkzeug-0.15.5

Without going into too much details, here's the code that I use to create the estimator, and then, submit the job. Nothing particularly fancy.

I tried debugging the image creation process (looking into the logs) and this is where I've captured what's shown above. I've also tried connecting using a python debugger to the running processes, and/or log to bash inside the running docker container to try python interactive to see what my problem is. Originally the problem is I can't use the torchvision.models.googlenet as it's not figuring in the version in use.

conda_packages=['pytorch', 'scikit-learn', 'torchvision==0.3.0']
pip_packages=['tqdm', 'ptvsd']

and I create my estimator with this :

pyTorchEstimator = PyTorch(source_directory='./aml-image-models',
                           compute_target=ct,
                           entry_script='train_network.py',
                           script_params=script_params,
                           node_count=1,
                           process_count_per_node=1,
                           conda_packages=conda_packages,
                           pip_packages=pip_packages,
                           use_gpu=True,
                           framework_version = '1.1')

and submit with typical code.

I'd expect given that I'm specifying 0.3.0 in the dependencies, that it would just work.

Thoughts?

Torchvision 0.3.0 for training a model on AML service

Answers (1)

Related Questions