Jacques Thibodeau
Jacques Thibodeau

Reputation: 959

How to delete a temp folder in Google Cloud (TPU) VM?

So I'm following the mesh-transformer fine-tuning repo to fine-tune GPT-J. I've fine-tuned a model on a Google Cloud PTU-VM before, but then deleted the fine-tuned model. Now, I'm trying to fine-tune a new model in the same VM, but the code is trying to open a tpu_logs file. It says it doesn't have permission, but I'm thinking maybe it's because it's deleted, not sure. Anyways, has anyone encountered this problem before?

I get a ton of these pop up in the terminal: Could not open the log file '/tmp/tpu_logs/tpu_driver.[vm-code].[my-account].log.INFO.20220623-164754.59430': Permission denied

It's as if there's something in the codebase that is aware that I've used the model for fine-tuning before and it's trying to (maybe?) continue where it left off?

Any thoughts on how to resolve this issue?

The entire output looks like:

.
.
.
Could not open the log file '/tmp/tpu_logs/tpu_driver.[vm-code].[my-account].log.INFO.20220623-164754.59430': Permission denied
Could not open any log file.
jax devices: 8
jax runtime initialized in 3.53237s
`--tune_model_path` passed: we are beginning a fine-tuning run
path to load checkpoint from: gs://[my-bucket]/step_383500/
setting up datasets
initializing network
/home/[my-username]/miniconda3/envs/finetune-env/lib/python3.8/site-packages/jax/experimental/maps.py:527: UserWarning: xmap is an experimental feature and probably has bugs!
  warn("xmap is an experimental feature and probably has bugs!")
/home/[my-username]/miniconda3/envs/finetune-env/lib/python3.8/site-packages/jax/_src/lib/xla_bridge.py:429: UserWarning: jax.host_count has been renamed to jax.process_count. This alias will eventually be removed; please update your code.
  warnings.warn(
Could not open the log file '/tmp/tpu_logs/tpu_driver.[vm-code].[my-account].log.INFO.20220623-164754.59430': Permission denied
.
.
.
Could not open any log file.
Could not open the log file '/tmp/tpu_logs/tpu_driver.[vm-code].[my-account].log.INFO.20220623-164754.59430': Permission denied
Could not open any log file.
/home/[my-username]/miniconda3/envs/finetune-env/lib/python3.8/site-packages/jax/_src/lib/xla_bridge.py:416: UserWarning: jax.host_id has been renamed to jax.process_index. This alias will eventually be removed; please update your code.
  warnings.warn(
key shape (8, 2)
in shape (1, 2048)
dp 1
mp 8
Could not open the log file '/tmp/tpu_logs/tpu_driver.[vm-code].[my-account].log.INFO.20220623-164754.59430': Permission denied
Could not open any log file.
.
.
.
Could not open any log file.
Could not open the log file '/tmp/tpu_logs/tpu_driver.[vm-code].[my-account].log.INFO.20220623-164754.59430': Permission deniedd
Could not open any log file.
Total parameters: 6053381344
loading network
Could not open the log file '/tmp/tpu_logs/tpu_driver.[vm-code].[my-account].log.INFO.20220623-164754.59430': Permission denied
Could not open any log file.
Could not open the log file '/tmp/tpu_logs/tpu_driver.[vm-code].[my-account].log.INFO.20220623-164754.59430': Permission denied
.
.
.

EDIT: The log file permission error seems to pop up when I try to run import tensorflow as tf.

Upvotes: 2

Views: 396

Answers (1)

Eduardo Ortiz
Eduardo Ortiz

Reputation: 759

From the results of the console that you are sharing, this might be happening since the account you are using currently doesn’t have the proper permissions (Storage Legacy > Storage Legacy Bucket Writer and Storage Legacy > Storage Legacy Bucket Reader).

Also I can see that you are using JAX with your Cloud TPU, to have a better control of the memory usage with JAX, you can refer to this troubleshooting guide.

Additionally, here’s some documentation that could help you when your disks are full, either they can be resized or you could delete some files in it.

Upvotes: 1

Related Questions