CODEWITHSUNDEEP

pythondownloaddatasethuggingface-datasets

Franck Dernoncourt

Franck Dernoncourt

Reputation: 83387

How can I download a HuggingFace dataset via HuggingFace CLI while keeping the original filenames?

I downloaded a dataset hosted on HuggingFace via the HuggingFace CLI as follows:

pip install huggingface_hub[hf_transfer]
huggingface-cli download huuuyeah/MeetingBank_Audio --repo-type dataset --local-dir-use-symlinks False

However, the downloaded files don't have their original filenames. Instead, their hashes (git-sha or sha256, depending on whether they’re LFS files) are used as filenames:

--- /home/dernonco/.cache/huggingface/hub/datasets--huuuyeah--MeetingBank_Audio/blobs ---------------------------------------------
                         /..                                                                                                       
   12.9 GiB [##########]  b581945ddee5e673fa2059afb25274b1523f270687b5253cb8aa72865760ebc0
    3.9 GiB [###       ]  86ebd2861a42b27168d75f346dd72f0e2b9eaee0afb90890beff15d025af45c6
    3.9 GiB [##        ]  f9b81739ee30450b930390e1155e2cdea1b3063379ba6fd9253513eba1ab1e05
    3.7 GiB [##        ]  e54c7d123ad93f4144eebdca2827ef81ea1ac282ddd2243386528cd157c02f36
    3.7 GiB [##        ]  736e225a7dd38a7987d0745b1b2f545ab701cfdf1f639874f5743b5bfb5cb1e1
    3.7 GiB [##        ]  0687246c92ec87b54e1c5fe623a77b650c02e6884e17a6f0fb4052a862d928d0
    3.6 GiB [##        ]  2becb5f9878b95f1b12622f50868f5855221985f05910d7cc759e6be074e6b8e
    3.5 GiB [##        ]  2208068c69b39c46ee9fac862da3c060c58b61adcaee1b3e6aa5d6d5dd3eba86
    3.5 GiB [##        ]  caf87e71232cbb8a31960a26ba30b9412c15893c831ef118196c581cfd3a3779
    3.4 GiB [##        ]  dc88cbf0ef45351bdc1f53c4396466d3e79874803719e266630ed6c3ad911d6a
    3.4 GiB [##        ]  f05f7fb3b55b6840ebc4ada5daa28742bbae6ad4dcc35781dc811024f27a1b4e
    3.4 GiB [##        ]  88bd831618b36330ef5cd84b7ccbc4d5f3f55955c0b223208bc2244b27fb2d78
    3.4 GiB [##        ]  bf80943b3389ddbeb8fb8a56af2d7fa5d09c5af076aac93f54ad921ee382c77d
    3.3 GiB [##        ]  83b2627e644c9ad0486e3bd966b02f014722e668d26b9d52394c974fcf2fdcf8
    3.2 GiB [##        ]  e52e7b086dabd431b25cf309e1fe513190543e058f4e7a2d8e05b22821ded4fe
    3.2 GiB [##        ]  4fe583348f3ac118f34c7b93b6a187ba4e21a5a7f5b6ca1a6adbce1cc6d563a9
    3.2 GiB [##        ]  ae6b6faca3bbd75e7ca99ccf20b55b017393bf09022efb8459293afffe06dc6e
    3.1 GiB [##        ]  5865379a894f8dc40703bdc1093d45fda67d5e1a742a2eebddd37e1a00f067fd
    3.1 GiB [##        ]  cd346324b29390a589926ccab7187ae818cf5f9fcbaf8ecc95313e6cdfab86bc
    3.0 GiB [##        ]  914eb2b1174a662e3faebac82f6b5591a54def39a9d3a7e5ab2347ecc87a982f
    2.9 GiB [##        ]  24789f33332e8539b2ee72a0a489c0f4d0c6103f7f9600de660d78543ade9111
    2.9 GiB [##        ]  35e8da5f831b36416c9569014c58f881a0a30c00db9f3caae0d7db6a8fd3c694
    2.8 GiB [##        ]  d5127e0298661d40a343d58759ed6298f9d2ef02d5c4f6a30bd9e07bc5423317
    2.8 GiB [##        ]  1b4e1951da2462ca77d94d220a58c97f64caa2b2defe4df95feed9defcee6ca7
    2.8 GiB [##        ]  75a4725625c095d98ecef7d68d384d7b1201ace046ef02ed499776b0ac02b61e
    2.8 GiB [##        ]  fefbbc3e87be522b7e571c78a188aba35bd5d282cf8f41257097a621af64ff60
 Total disk usage: 184.8 GiB  Apparent size: 184.8 GiB  Items: 85

How can I download a HuggingFace dataset via HuggingFace CLI while keeping the original filenames?

Upvotes: 1

Views: 2547

Answers (3)

Reputation: 77

cp -L allows you to follow symbolic links when copying files. Just cp -L huggingface/hub/<YOUR_REPO>/snapshots/<HASH> <DESTINATION_PATH>. You can also use --reflink to avoid extra usage of a copy.

Upvotes: 0

SWHL

Reputation: 76

I met the same problem, and wrote a python script to handle this problem.

For example, I download the naver-clova-ix/synthdog-en dataset by:

$ huggingface-cli download --repo-type dataset --resume-download naver-clova-ix/synthdog-en --local-dir synthdog-en

The synthdog-en directory structure is as follows:

synthdog-en
├── README.md
├── data
│   ├── train-00000-of-00084-26dbc51f3d0903b9.parquet -> ../../../.cache/huggingface/hub/datasets--naver-clova-ix--synthdog-en/blobs/9d0260e08cb5a4f9c14fa794465bcb66fae6ef7ccc2f6d7ef20efa44810c0648
│   ├── train-00001-of-00084-3efa94914043c815.parquet -> ../../../.cache/huggingface/hub/datasets--naver-clova-ix--synthdog-en/blobs/04441e203ff713743c0c9a1009f71f97e47bc4d7b2c9313f4fcfa9c3e73b20e3
│   ├── ...
│   └── validation-00000-of-00001-394e0bd4c5ebec42.parquet -> ../../../.cache/huggingface/hub/datasets--naver-clova-ix--synthdog-en/blobs/4e5f27b7a976041855d80eb07680de4ea014be07a494f40b246058dfce46d44b
└── dataset_infos.json

The full python script code is as follows:

import shutil
from pathlib import Path

from tqdm import tqdm


def cp_symlink_file_to_dst(file_path: Path, dst_dir: Path):
    if not file_path.is_symlink():
        return

    real_file_path = file_path.readlink()
    real_file_path = Path.home() / str(real_file_path).rpartition("../")[-1]

    real_file_name = file_path.name

    dst_file_path = Path(dst_dir) / real_file_name

    shutil.copy(real_file_path, dst_file_path)


if __name__ == "__main__":
    data_dir = Path("data")
    data_paths = list(data_dir.glob("*.parquet"))

    dst_dir = Path("output")
    dst_dir.mkdir(parents=True, exist_ok=True)
    for file_path in tqdm(data_paths):
        cp_symlink_file_to_dst(file_path, dst_dir)

The output directory is as follows:

output
├── train-00000-of-00084-26dbc51f3d0903b9.parquet
├── train-00001-of-00084-3efa94914043c815.parquet
├── ...
├── train-00083-of-00084-5e6bb79e23f90f3b.parquet
└── validation-00000-of-00001-394e0bd4c5ebec42.parquet

Upvotes: 1

Franck Dernoncourt

Franck Dernoncourt

Reputation: 83387

One has to look at the snapshots folder:

/home/username/.cache/huggingface/hub/datasets--huuuyeah--MeetingBank_Audio/snapshots

It contains the original, readable filenames. However, the files are symlinks pointing to the blob files with hashes as filenames. One can replace these symbolic links with the actual files (stored in the blob), and it'll give the original files with the original filenames.

To replace the symbolic links with the actual files on Linux, one can use u1686_grawity's script:

script.sh:

#!/bin/sh
set -e
for link; do
    test -h "$link" || continue

    dir=$(dirname "$link")
    reltarget=$(readlink "$link")
    case $reltarget in
        /*) abstarget=$reltarget;;
        *)  abstarget=$dir/$reltarget;;
    esac

    rm -fv "$link"
    cp -afv "$abstarget" "$link" || {
        # on failure, restore the symlink
        rm -rfv "$link"
        ln -sfv "$reltarget" "$link"
    }
done

To run:

find . -type l -exec /path/to/script.sh {} +

result:

(base) dernoncourt@server:~/.cache/huggingface/hub/datasets--huuuyeah--MeetingBank_Audio/snapshots$ tree --du -h
.
└── [255G]  27779a666ff5fd879f4c5567489ff47e82364abd
    ├── [ 29G]  Alameda
    │   ├── [499M]  Alameda-transcripts-videolist.zip
    │   └── [ 29G]  mp3
    │       ├── [3.7G]  alameda-1.zip
    │       ├── [3.5G]  alameda-2.zip
    │       ├── [3.9G]  alameda-3.zip
    │       ├── [3.7G]  alameda-4.zip
    │       ├── [3.3G]  alameda-5.zip
    │       ├── [3.4G]  alameda-6.zip
    │       ├── [2.9G]  alameda-7.zip
    │       ├── [3.0G]  alameda-8.zip
    │       └── [1.2G]  alameda-9.zip
    ├── [3.9G]  Boston
    │   ├── [3.2K]  boston_video_list.txt
    │   ├── [3.9G]  mp3
    │   │   └── [3.9G]  Boston.zip
    │   └── [8.1M]  transcripts
    │       └── [8.1M]  transcripts.zip
    ├── [ 54G]  Denver
    │   ├── [650M]  Denver-transcripts-videolist.zip
    │   └── [ 53G]  mp3
    │       ├── [2.0G]  Denver-1.zip
    │       ├── [3.2G]  Denver-10.zip
    │       ├── [2.5G]  Denver-11.zip
    │       ├── [2.7G]  Denver-12.zip
    │       ├── [1.5G]  Denver-13.zip
    │       ├── [2.5G]  Denver-14.zip
[...]

Upvotes: 0

Related Questions