asaf levi
asaf levi

Reputation: 11

Unable to Download Large Hugging Face Dataset to Google Drive in Colab

I'm trying to download a large dataset from Hugging Face in Google Colab, but I keep running into storage issues. Since the dataset is too large for Colab’s local disk, I want to directly store it in my Google Drive, which has enough space.

I've tried the following approaches without success:

  1. Set environment variables like HF_HOME, DOWNLOADED_DATASETS_PATH, HF_DATASETS_CACHE, and HF_CACHE_HOME to point to a Google Drive directory:
import os
from pathlib import Path
from datasets import config

datasets_drive_dir = "/content/drive/MyDrive/my_huggingface"

if not os.path.isdir(datasets_drive_dir):
print("Directory doesn't exist - creating it")
os.mkdir(datasets_drive_dir)

os.environ\['HF_HOME'\] = datasets_drive_dir
os.environ\['DOWNLOADED_DATASETS_PATH'\] = datasets_drive_dir
os.environ\['HF_DATASETS_CACHE'\] = datasets_drive_dir
os.environ\['HF_CACHE_HOME'\] = datasets_drive_dir

config.DOWNLOADED_DATASETS_PATH = Path(datasets_drive_dir)
config.HF_DATASETS_CACHE = Path(datasets_drive_dir)
config.HF_CACHE_HOME = Path(datasets_drive_dir)
  1. Changed the cache directory by setting cache_dir in the function itself: load_dataset(DATASET_NAME, 'pre', cache_dir=datasets_drive_dir)

Despite these efforts, the dataset still tries to download to Colab’s local storage, and I run out of space :(

Also when I've tried the streaming mode, but I encountered the following error during training:

huggingface_hub.utils._errors.HfHubHTTPError: 500 Server Error: Internal Server Error..

Upvotes: 0

Views: 96

Answers (0)

Related Questions