Reputation: 11
I'm trying to download a large dataset from Hugging Face in Google Colab, but I keep running into storage issues. Since the dataset is too large for Colab’s local disk, I want to directly store it in my Google Drive, which has enough space.
I've tried the following approaches without success:
import os
from pathlib import Path
from datasets import config
datasets_drive_dir = "/content/drive/MyDrive/my_huggingface"
if not os.path.isdir(datasets_drive_dir):
print("Directory doesn't exist - creating it")
os.mkdir(datasets_drive_dir)
os.environ\['HF_HOME'\] = datasets_drive_dir
os.environ\['DOWNLOADED_DATASETS_PATH'\] = datasets_drive_dir
os.environ\['HF_DATASETS_CACHE'\] = datasets_drive_dir
os.environ\['HF_CACHE_HOME'\] = datasets_drive_dir
config.DOWNLOADED_DATASETS_PATH = Path(datasets_drive_dir)
config.HF_DATASETS_CACHE = Path(datasets_drive_dir)
config.HF_CACHE_HOME = Path(datasets_drive_dir)
load_dataset(DATASET_NAME, 'pre', cache_dir=datasets_drive_dir)
Despite these efforts, the dataset still tries to download to Colab’s local storage, and I run out of space :(
Also when I've tried the streaming mode, but I encountered the following error during training:
huggingface_hub.utils._errors.HfHubHTTPError: 500 Server Error: Internal Server Error..
Upvotes: 0
Views: 96